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Abstract 

This work considers a computationally and statistically efficient parameter estimation method 
for a wide class of latent variable models — including Gaussian mixture models, hidden Markov 
models, and latent Dirichlet allocation — which exploits a certain tensor structure in their low- 
order observable moments (typically, of second- and third-order) . Specifically, parameter estima- 
tion is reduced to the problem of extracting a certain (orthogonal) decomposition of a symmetric 
tensor derived from the moments; this decomposition can be viewed as a natural generalization 
of the singular value decomposition for matrices. Although tensor decompositions are generally 
intractable to compute, the decomposition of these specially structured tensors can be efficiently 
obtained by a variety of approaches, including power iterations and maximization approaches 
(similar to the case of matrices). A detailed analysis of a robust tensor power method is pro- 
vided, establishing an analogue of Wedin's perturbation theorem for the singular vectors of 
matrices. This implies a robust and computationally tractable estimation approach for several 
popular latent variable models. 

1 Introduction 

The method of moments is a classical parameter estimation technique [Pca94j from statistics which 
has proved invaluable in a number of application domains. The basic paradigm is simple and in- 
tuitive: (i) compute certain statistics of the data — often empirical moments such as means and 
correlations — and (ii) find model parameters that give rise to (nearly) the same corresponding 
population quantities. In a number of cases, the method of moments leads to consistent estima- 
tors which can be efficiently computed; this is especially relevant in the context of latent variable 
models, where standard maximum likelihood approaches are typically computationally prohibitive, 
and heuristic methods can be unreliable and difficult to validate with high-dimensional data. Fur- 
thermore, the method of moments can be viewed as complementary to the maximum likelihood 
approach; simply taking a single step of Newton-Ralphson on the likelihood function starting from 
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the moment based estimator |Le 86] often leads to the best of both worlds: a computationally 
efficient estimator that is (asymptotically) statistically optimal. 

The primary difficulty in learning latent variable models is that the latent (hidden) state of 
the data is not directly observed; rather only observed variables correlated with the hidden state 
are observed. As such, it is not evident the method of moments should fare any better than 
maximum likelihood in terms of computational performance: matching the model parameters to 
the observed moments may involve solving computationally intractable systems of multivariate 
polynomial equations. Fortunately, for many classes of latent variable models, there is rich structure 
in low-order moments (typically second- and third-order) which allow for this inverse moment 
problem to be solved efficiently |Cat44l ICar91l ICha96l IMK061 IHKZ121 IAHK121 IAFH+121 IMK12] . 
What is more is that these decomposition problems are often amenable to simple and efficient 
iterative methods, such as gradient descent and the power iteration method. 



1.1 Contributions 

This work observes that a number of important and well-studied latent variable models — including 
Gaussian mixture models, hidden Markov models, and Latent Dirichlet allocation — share a certain 
structure in their low-order moments, and this permits certain tensor decomposition approaches 
to parameter estimation. In particular, this particular tensor decomposition can be viewed as a 
natural generalization of the singular value decomposition for matrices. 

While much of this (or similar) structure was implicit in several previous works [Cha96, MR06, 
HKZ12, AH K121 |AFH + 12"1 IHK12] , here we make the decomposition explicit under a unified frame- 
work. Specifically, we express the observable moments as sums of rank-one terms, and reduce the 
parameter estimation task to the problem of extracting a symmetric orthogonal decomposition of 
symmetric tensor derived from these observable moments. The problem can then be solved by a 
variety of approaches, including fixed-point and variational methods. 

One approach for obtaining the orthogonal decomposition is the tensor power method of [LMVOO, 
Remark 3]. We provide a convergence analysis of this method for orthogonally decomposable sym- 
metric tensors, as well as a detailed perturbation analysis for a robust (and a computationally 
tractable) variant. This perturbation analysis can be viewed as an analogue of Wedin's pertur- 
bation theorem for singular vectors of matrices |Wed72 ] . providing a bound on the error of the 
recovered decomposition in terms of the operator norm of the tensor perturbation. This analysis 
is subtle in at least two ways. First, unlike for matrices (where every matrix has a singular value 
decomposition), an orthogonal decomposition need not exist for the perturbed tensor. Our robust 
variant uses random restarts and deflation to extract an approximate decomposition in a com- 
putationally tractable manner. Second, the analysis of the deflation steps is non-trivial; a naive 
argument would entail error accumulation in each deflation step, which we show can in fact be 
avoided. When this method is applied for parameter estimation in latent variable models previ- 
ously discussed, improved sample complexity bounds (over previous work) can be obtained using 
this perturbation analysis. 

Finally, we also address computational issues that arise when applying the tensor decomposition 
approaches to estimating latent variable models. Specifically, we show that the basic operations of 
simple iterative approaches (such as the tensor power method) can be efficiently executed in time 
linear in the dimension of the observations and the size of the training data. For instance, in a 
topic modeling application, the proposed methods require time linear in the number of words in the 
vocabulary and in the number of non-zero entries of the term-document matrix. The combination 
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of this computational efficiency and the robustness of the tensor decomposition techniques makes 
the overall framework a promising approach to parameter estimation for latent variable models. 



1.2 Related work 
Tensor decompositions 

The role of tensor decompositions in the context of latent variable models dates back to early 
uses in psychometrics |Cat44| . These ideas later gained popularity in chemometrics, and more 
recently in numerous science and engineering disciplines, including neuroscience, phylogenetics, 
signal processing, data mining, and computer vision. A thorough survey of these techniques and 
applications can be found in [KB09]. Below, we discuss a few specific connections to two applications 
in machine learning and statistics: independent component analysis and latent variable models. 

Tensor decompositions have been used in signal processing and computational neuroscience 
for blind source separation and independent component analysis (ICA) |CJ10j . Here, statistically 
independent non-Gaussian sources are linearly mixed in the observed signal, and the goal is to 
recover the mixing matrix (and ultimately, the original source signals). A typical solution is to 
locate projections of the observed signals that correspond to local extrema of the so-called "contrast 
functions" which distinguish Gaussian variables from non-Gaussian variables. This method can be 
effectively implemented using fast descent algorithms |Hyv99|. When using the excess kurtosis (i.e., 
fourth-order cumulant) as the contrast function, this method reduces to a generalization of the 
power method for symmetric tensors [LMVOO j IZG01|, lKR02j . This case is particularly important, 
since all local extrema of the kurtosis objective correspond to the true sources (under the assumed 
statistical model) [DL95 ; the descent methods can therefore be rigorously analyzed, and their 
computational and statistical complexity can be bounded [FJK96, N R09} IAGMS121 IHK12| . 

Another line of works views a tensor decomposition as a simultaneous diagonalization of a 
collection of matrices obtained from a tensor. This was employed for parameter estimation of dis- 
crete Markov models [Cha96] using pair-wise and triple-wise probability tables. This idea has been 
extended to other latent variable models such as hidden Markov models (HMMs), latent trees, Gaus- 
sian mixture models, and topic models such as latent Dirichlet allocation (LDA) [MR06, HKZ12, 
AH KT21 lAFH+121 IHK12] . Simult aneous diagonalization is also used for many other applications, 



including blind source separation and ICA (as discussed above), and a number of efficient algo- 
rithms have been developed for this problem [BGBM93L IU5931 IGar94l ICC961 IGGT97L IZLJNM04] . 
Another reduction from tensors to matrices called flattening has also been used to solve tensor de- 
composition problems via matrix eigenvalue techniques [Car91| [DLCC 07j . One advantage of these 
methods is that they can be used to estimate under-determined mixtures, where the number of 
sources is larger than the observed dimension. 

The relevance of tensor analysis to latent variable modeling has been long recognized in the field 
of algebraic statistics |PS05| , and many works characterize the algebraic varieties corresponding to 
the moments of various classes of latent variable models [DSS071 ISZ11| . These works typically do 
not address computational or finite sample issues, but rather are concerned with basic questions of 
identifiability. 

The specific tensor structure considered in the present work is the symmetric orthogonal de- 
composition. This decomposition expresses a tensor as a linear combination of simple tensor forms; 
each form is the tensor product of a vector (i.e., a rank-1 tensor), and the collection of vectors form 
an orthonormal basis. An important property of tensors with such decompositions is that they 
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have eigenvectors corresponding to these basis vectors. Although the concepts of eigenvalues and 
eigenvectors of tensors is generally significantly more complicated than their matrix counterpart 
(both algebraically |IQi05, ICSlll ILim05] and computationally |HL12l |KR02| ). the special symmet- 



ric orthogonal structure we consider permits simple algorithms to efficiently and stably recover the 
desired decomposition. In particular, a generalization of the matrix power method to symmetric 
tensors, introduced in [LMVOOl Remark 3] and analyzed in jKR02] . provides such a decomposition. 
This is in fact implied by the characterization in [ZG01], which shows that iteratively obtaining 
the best rank-1 approximation of such orthogonally decomposable tensors also yields the exact 
decomposition. We note that in general, obtaining such approximations for general (symmetric) 
tensors is NP-hard |HL12j . 

Latent variable models 

This work focuses on the particular application of tensor decomposition methods to estimating 
latent variable models, a significant departure from many previous approaches in the machine 
learning and statistics literature. By far the most popular heuristic for parameter estimation for 
such models is the Expectation-Maximization (EM) algorithm [DLR77, RW84J. Although EM has 
a number of merits, it may suffer from slow convergence and poor quality local optima [RW84J, 
requiring practitioners to employ many additional heuristics to obtain good solutions. For some 
models such as latent trees [Roc06j and topic models [AGM12J, maximum likelihood estimation is 
NP-hard, which suggests that other estimation approaches may be more attractive. More recently, 
algorithms from theoretical computer science and machine learning have addressed computational 
and sample complexity issues related to estimating certain latent variable models such as Gaussian 
mixture models and HMMs [Das99l IAK0H IDS071 IVW021 IKSV051 [A"M05l IGRMI lBV()8l IKMV101 
IBSTUI IM V101 IHK121 IGha961 IMR061 IHKZ121 IXHKT21 IAGM121 lAFH+12] . See |AHK12llHKT2] for a 



discussion of these methods, together with the computational and statistical hardness barriers that 
they face. The present work reviews a broad range of latent variables where a mild non-degeneracy 
condition implies the symmetric orthogonal decomposition structure in the tensors of low-order 
observable moments. 

Notably, another class of methods, based on subspace identification [OM96J and observable 
operator models/multiplicity automata [Sch61~ | iJaeOO} iLSSOlj. have been proposed for a number 
of latent variable models. These methods were successfully developed for HMMs in [HKZ12J, and 
subsequently generalized and extended for a number of related sequential and tree Markov models 
models |SBG10l IBailll IBSGlll iPSXTTl IFRU121 |BQG12t IBM12] , as well as certain classes of parse 



tree models |LQBC12], ICSC + 12"1 |DRC + 12] . These methods use low-order moments to learn an 
"operator" representation of the distribution, which can be used for density estimation and belief 
state updates. While finite sample bounds can be given to establish the learnability of these 
models [HKZ12J, the algorithms do not actually give parameter estimates (e.g., of the emission or 
transition matrices in the case of HMMs) . 

1.3 Organization 

The rest of the paper is organized as follows. Section [2] reviews some basic definitions of tensors. 
Section [3] provides examples of a number of latent variable models which, after appropriate manip- 
ulations of their low order moments, share a certain natural tensor structure. Section [4] reduces 
the problem of parameter estimation to that of extracting a certain (symmetric orthogonal) de- 
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composition of a tensor. We then provide a detailed analysis of a robust tensor power method 
and establish an analogue of Wedin's perturbation theorem for the singular vectors of matrices. 
The discussion in Section [6] addresses a number of practical concerns that arise when dealing with 
moment matrices and tensors. 

2 Preliminaries 

A real p-th order tensor A G ®f=i is a member of the tensor product of Euclidean spaces 
M Ui , i G [p]. We generally restrict to the case where n\ = n<i = • • ■ = n p = n, and simply write 
A G (g) p K n . For a vector w e K n , we use v m := v ® v ® ■ ■ ■ ® v E (g) p W 1 to denote its p-th tensor 
power. As is the case for vectors (where p = 1) and matrices (where p = 2), we may identify a p-th 
order tensor with the p-way array of real numbers [A-ix,i2,—,i P '■ h,i2, • • • ,i p G [n]], where Ai li i 2> __, t i p 
is the (ii, • ■ ■ , i p )-th coordinate of A (with respect to a canonical basis). 

We can consider A to be a multilinear map in the following sense: for a set of matrices {Vi G 
jgmxm; . ^ ^ [p]} , the (ii,i2, • • • , i p )-th. entry in the p-way array representation of A(Vi, V2, . . . , V^,) £ 

jgmiXm2X-xm p -g 

[A(Vi,V2,...,Vp)]i 1) i 2! ... t i p := ^2 A ji,j2,---,j P [ v l]ji,h i V 2]h,i2 [ V p\j P ,i P - 

ji,j2,-,j P e[n] 

Note that if A is a matrix (p = 2), then 

A(Vi,^ 2 ) = ViAV 2 . 
Similarly, for a matrix A and vector v G R n , we can express as 

A(I,v) = Av G M n , 

where / is the n x n identity matrix. As a final example of this notation, observe 

e i2> • • • ) e i P ) = -^-11,12,— ,ipi 

where {e\, e2, . . . , e n } is the canonical basis for W n . 

Most tensors A G (££) p IR n considered in this work will be symmetric (sometimes called super- 
symmetric), which means that their p-way array representations are invariant to permutations of 
the array indices: i.e., for all indices i x , i 2 , . ■ . , i p G [n], A ilti2t _ tip = ^ {1) ,v (2) ,...,v (p) for any permu- 
tation 7r on \p\. It can be checked that this reduces to the usual definition of a symmetric matrix 
for p = 2. 

The rank of a p-th order tensor A G W 1 is the smallest non-negative integer k such that 
A = Y2j=i u i,j ® u 2,j ® • • • <8) for some Ujj G M n ,i G [p],j G [fc], and the symmetric rank of a 
symmetric p-th order tensor A is the smallest non-negative integer k such that A = Y2j=i u f P ^ OT 
some Uj G M n ,j G [&]. The notion of rank readily reduces to the usual definition of matrix rank 
when p = 2, as revealed by the singular value decomposition. Similarly, for symmetric matrices, 
the symmetric rank is equivalent to the matrix rank as given by the spectral theorem. 

The notion of tensor (symmetric) rank is considerably more delicate than matrix (symmetric) 
rank. For instance, it is not clear a priori that the symmetric rank of a tensor should even be 
finite [CGLM08J. In addition, removal of the best rank-1 approximation of a (general) tensor may 
increase the tensor rank of the residual |SC10| . 
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Throughout, we use \\v\\ = (Yli^) 1 ^ 2 to denote the Euclidean norm of a vector v, and ||M|| to 
denote the spectral (operator) norm of a matrix. We also use ||T|| to denote the operator norm of 
a tensor, which we define later. 

3 Tensor structure in latent variable models 

In this section, we give several examples of latent variable models whose low-order moments can 



be written as symmetric tensors of low symmetric rank. This form is demonstrated in Theorem 3.1 
for the first example. The general pattern will emerge from subsequent examples. 

3.1 Exchangeable single topic models 

We first consider a simple bag-of-words model for documents in which the words in the document 
are assumed to be exchangeable. Recall that a collection of random variables x±, x 2 , . . . , X£ are 
exchangeable if their joint probability distribution is invariant to permutation of the indices. The 
well-known De Finetti's theorem |Aus08j implies that such exchangeable models can be viewed 
as mixture models in which there is a latent variable h such that x±,X2, ■ ■ ■ ,xe are conditionally 



i.i.d. given h (see Figure 1(a) for the corresponding graphical model) and the conditional distribu- 
tions are identical at all the nodes. 

In our simplified topic model for documents, the latent variable h is interpreted as the (sole) 
topic of a given document, and it is assumed to take only a finite number of distinct values. Let k 
be the number of distinct topics in the corpus, d be the number of distinct words in the vocabulary, 
and t > 3 be the number of words in each document. The generative process for a document is 
as follows: the document's topic is drawn according to the discrete distribution specified by the 
probability vector w := (w\, w 2 , ■ ■ ■ , Wf.) £ A fc_1 . This is modeled as a discrete random variable h 
such that 

Pr[h = j] = Wj, j £ [k]. 

Given the topic h, the document's £ words are drawn independently according to the discrete 
distribution specified by the probability vector [i^ £ A d_1 . It will be convenient to represent the £ 
words in the document by d-dimensional random vectors x\,x 2 , . . . ,x? € M d . Specifically, we set 

Xt = e{ if and only if the i-th word in the document is i, t £ [£], 

where e\, e^i ■ ■ • e<j is the standard coordinate basis for U. d . 

One advantage of this encoding of words is that the (cross) moments of these random vectors 
correspond to joint probabilities over words. For instance, observe that 

E[x\ (g> x 2 ] = Pr[xi = e«, x 2 = ej) &i ® Cj 

l<i,j<d 

= Pr[lst word = i, 2nd word = j] ej (g> ej, 

l<i,j<d 

so the (i,j)-the entry of the matrix E[xi (g> x 2 ] is Pr[lst word = i, 2nd word = j]. More generally, 
the (ii,i 2 , ■ ■ ■ ,ii)-th. entry in the tensor K[xi ® x 2 <g) • • • (8) x^\ is Pr[lst word = «i,2nd word = 
i 2 , . . . , £-th word = This means that estimating cross moments, say, of x\ <g> x 2 <g> X3, is the same 
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as estimating joint probabilities of the first three words over all documents. (Recall that we assume 
that each document has at least three words.) 

The second advantage of the vector encoding of words is that the conditional expectation of xt 
given h = j is simply fj,j, the vector of word probabilities for topic j: 



d 



E[x t \h = j] = Pr[t-th word = i\h = j] &j = y~]\p.j]i e, = Hj, j G [k] 
i=i i=i 

(where \pj]i is the i-th entry in the vector Hj)- Because the words are conditionally independent 
given the topic, we can use this same property with conditional cross moments, say, of x\ and x 2 : 

E[xi <g> X2\h = j] = E[x\\h = j] (g> E[x 2 \h = j] = Hj^Hi j e [k]. 

This and similar calculations lead one to the following theorem. 

Theorem 3.1 ([ AHK12] ). If 

M 2 := E[xi ® x 2 ] 

M 3 := E[xi ® x 2 <S> x 3 ], 



then 



k 

M 2 = m 



i=i 

k 



M 3 = y^Wi Hi® Hi® Hi- 



i=l 



As we will see in Section 4.3, the structure of M 2 and M 3 revealed in Theorem 3.1 implies that 
the topic vectors Hli M2> • • • j can be estimated by computing a certain symmetric tensor decom- 
position. Moreover, due to exchangeability, all triples (resp., pairs) of words in a document — and 



not just the first three (resp., two) words — can be used in forming M 3 (resp., M 2 ); see Section 6.1 



3.2 Beyond raw moments 

In the single topic model above, the raw (cross) moments of the observed words directly yield 
the desired symmetric tensor structure. In some other models, the raw moments do not explicitly 
have this form. Here, we show that the desired tensor structure can be found through various 
manipulations of different moments. 



Spherical Gaussian mixtures 

We now consider a mixture of k Gaussian distributions with spherical covariances. We start with 
the simpler case where all of the covariances are identical; this probabilistic model is closely related 
to the (non-probabilistic) /c- means clustering problem [Mac67] . We then consider the case where 
the spherical variances may differ. 
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Common covariance. Let Wi be the probability of choosing component iG [k], {hi, H2i . . . , Hk} C 
M. d be the component mean vectors, and a 2 I be the common covariance matrix. An observation in 
this model is given by 

x := n h + z, 

where h is the discrete random variable with Pr[/i = i] = Wi for i £ [k] (similar to the exchangeable 
single topic model), and z ~ A/"(0, a 2 1) is an independent multivariate Gaussian random vector in 
M. d with zero mean and spherical covariance a 2 I. 

The Gaussian mixture model differs from the exchangeable single topic model in the way obser- 
vations are generated. In the single topic model, we observe multiple draws (words in a particular 
document) xi,X2, ■ ■ ■ ,%i given the same fixed h (the topic of the document). In contrast, for the 
Gaussian mixture model, every realization of x corresponds to a different realization of h. 

Theorem 3.2 QHK12]). Assume d>k. The variance a 2 is the smallest eigenvalue of the covari- 
ance matrix E[x (8 x] — E[x] (8 E[x]. Furthermore, if 

M 2 := E[x (8 x] - cr 2 I 

d 

M 3 := E[x ® x (g> x] - a 2 ^ (E[x] (8 e; <8 e, + e* <8 E[x] (8) + e» <8 ej <8 E[x]) , 

i=l 

f/jen 

fc 

M 2 = /ij (8) /ii 

i=l 

fc 

M 3 = (ii® Hi. 

i=i 

Differing covariances. The general case is where each component may have a different spherical 
covariance. An observation in this model is again x = Hh + z , but now z £ M rf is a random vector 
whose conditional distribution given h = i (for some i G [&]) is a multivariate Gaussian J\f(0,o~fl) 
with zero mean and spherical covariance cr 2 I. 

Theorem 3.3 QHK12]). Assume d > k. The average variance a 2 := ^2i = \WiO~i is the smallest 
eigenvalue of the covariance matrix E[x <8 x] — E[x] (8) E[x]. £e£ t> 6e any imii norm eigenvector 
corresponding to the eigenvalue a 2 . If 

Mi := E[x(v T (x-E[x})) 2 } 
M 2 := E[x <8 x] - a 2 / 

M 3 := E[x (8 x (8 x\ - V] (Mi (864(864 + e* (8 M x (8 ej + (8 ej (8 Mi) , 

i=l 

i/ien 

fe 

M 2 = ^ itfj /Xj (8 //j 
i=l 
fe 

M 3 = ^2 Wi ^ <8 Hi ® 
i=l 
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As shown in [HK12J, Mi = s ^l=\ w i a 'll JL i- Note that for the common covariance case, where 



af = a 2 , we have that Mi = a 2 E[x] (cf. Theorem 3.2) 



Independent component analysis (ICA) 

The standard model for ICA |Com94HCC9 6, HO00, CJ10J, in which independent signals are linearly 
mixed and corrupted with Gaussian noise before being observed, is specified as follows. Let /jGM* 
be a latent random vector with independent coordinates, A G M rfxfc the mixing matrix, and z be a 
multivariate Gaussian random vector. The random vectors h and z are assumed to be independent. 
The observed random vector is 

x := Ah + z. 

Let m denote the i-th column of the mixing matrix A. 
Theorem 3.4 ( [CJIOj ). Define 

M A := E[x ®x®x®x]—T 

where T is the fourth-order tensor with 

[ T \h,i2MM ■= ^[xi 1 Xi 2 ]E[xi 3 x i4 \ + E[x il Xi z ]E[xi 2 x ii } +E[x il Xi 4 ]E[x i2 x i3 }, 1 < ii,i 2 ,«3,*4 < k 

(i.e., T is the fourth derivative tensor of the function v i— > 8~ 1 E[(v T x) 2 ] 2 . Let Ki := E[hf] — 3 for 
each i G [k]. Then 



M 4 = y~] Kj jij ® fj,j g) fjij <8) 



Sec [HK12J for a proof of this theorem in this form. Note that corresponds to the excess 
kurtosis, a measure of non-Gaussianity as Kj = if hi is a standard normal random variable. 
Furthermore, note that A is not identifiable if h is a multivariate Gaussian. 



We may derive forms similar to that of M 2 and M3 from Theorem 3.1 using M4 by observing 
that 



M 4 (I,I,u,v) = ^2Ki(fiJu)(fiJv) Hi® 
i=i 

k 

m 4 (j, I, i,v) = y j K-iinJv) Hi® ^® 



1=1 

for any vectors u,v £ W 1 . 

Latent Dirichlet Allocation (LDA) 

An increasingly popular class of latent variable models are mixed membership models, where each 
datum may belong to several different latent classes simultaneously. LDA is one such model for 
the case of document modeling; here, each document corresponds to a mixture over topics (as 
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opposed to just a single topic). The distribution over such topic mixtures is a Dirichlet distribution 
Dir(a) with parameter vector a G with strictly positive entries; its density over the probability 
simplex A fe_1 := {v G R k : v { G [0, l]Vi G [k], Yh=i v i = 1} is § iven b y 

where 

ao : = «i + "2 H + «fc- 

As before, the A; topics are specified by probability vectors Ml>M2j • ••>/■**; G A d_1 . To generate 
a document, we first draw the topic mixture h = (hi, h2, ■ ■ ■ , hk) ~ Dir(a), and then conditioned 
on h, we draw t words xi, x 2 , ■ ■ ■ , xg independently from the discrete distribution specified by the 
probability vector Y2i=i hitH for each x t , we independently sample a topic j according to h 

and then sample xt according to fjtj). Again, we encode a word Xt by setting xt = &i iff the i-th 
word in the document is i. 

The parameter ao (the sum of the "pseudo-counts") characterizes the concentration of the 
distribution. As ao — > 0, the distribution degenerates to a single topic model (i.e., the limiting 
density has, with probability 1, exactly one entry of h being 1 and the rest are 0). At the other 
extreme, if a = (c, c, . . . , c) for some scalar c > 0, then as ao = ck — > oo, the distribution of 
h becomes peaked around the uniform vector (l/k,l/k, . . . ,1/k) (furthermore, the distribution 
behaves like a product distribution). We are typically interested in the case where ao is small (e.g., 
a constant independent of k), whereupon h typically has only a few large entries. This corresponds 
to the setting where the documents are mainly comprised of just a few topics. 

Theorem 3.5 ( |AFH+12j ). Define 
Mi := E[xi] 

Oin 

M 2 := Eh <g> x 2 ] —Mi <g> Mi 

a + 1 

M 3 := E[xi ® x 2 x 3 ) 

an / 

' E[xi <g> x 2 <8> Mi] + E[xi <g> Mi <g> x 2 ] + E[Mi 0xi® x 2 ] 



Then 



a + 2 

2a^ 

(a + 2)(a + 1) 



M 2 = > t — t — m <8> im 

f-f (a + l)a 
i=i 

M 3 = £ 



e-f (a + 2)(a + l)a 

4 = 1 



Mi <&> Mi <&> Mi 



Note that ao needs to be known to form M 2 and M3 from the raw moments. This, however, 
is a much weaker than assuming that the entire distribution of h is known (i.e., knowledge of the 
whole parameter vector a). 
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(a) Multi-view models 



(b) Hidden Markov model 



Figure 1: Examples of latent variable models. 
3.3 Multi-view models 

Multi-view models (also sometimes called naive Bayes models) are a special class of Bayesian 
networks in which observed variables xi,X2, ■ ■ ■ ,xg are conditionally independent given a latent 
variable h. This is similar to the exchangeable single topic model, but here we do not require 
the conditional distributions of the xt,t G [£] to be identical. Techniques developed for this class 
can be used to handle a number of widely used models including hidden Markov models (HMMs) 
|MR06l IAHK12j . phylogenetic tree models |Cha96l IMR06] . certain tree mixtures |AHHK12j . and 
certain probabilistic grammar models [HKL12]. 

As before, we let h E [k] be a discrete random variable with Pr[/i = j] = Wj for all j E [k]. Now 
consider random vectors x\ E M. dl , X2 E M. d2 , and X3 E M. d3 which are conditionally independent 
given h, and 

E[x t \h = j]= f J 4 , j , j € [*], t€ {1,2,3} 

where the fxtj G R * are the conditional means of the x% given h = j. Thus, we allow the observations 
xi,X2, ■ ■ ■ ,X£ to be random vectors, parameterized only by their conditional means. Importantly, 
these conditional distributions may be discrete, continuous, or even a mix of both. 
We first note the form for the raw (cross) moments. 

Proposition 3.1. We have that: 

k 

E[x t ®x t >] = fj, t ',i, {t,t'} C {1,2,3}, t^t' 

i=l 
k 

E[xi 0x 2 0x 3 ] = Wi [i^i fi 2 ,i M3,i- 
1=1 

The cross moments do not possess a symmetric tensor form when the conditional distributions 
are different. Nevertheless, the moments can be "symmetrized" via a simple linear transformation 
of x\ and X2 (roughly speaking, this relates x\ and X2 to X3); this leads to an expression from which 
the conditional means of X3 (i.e., /i3,i,/i3,2, • • • , ^3,k) can be recovered. For simplicity, we assume 
d\ = d,2 = d>3 = k; the general case (with dt > A;) is easily handled using low-rank singular value 
decompositions. 

Theorem 3.6 f jAFH+12 ]). A ssume that the vectors {n v ,i, ^v,2, ■ ■ ■ ; l^v,k} are linearly independent 
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for each v G {1, 2,3}. Define 



x 2 
M 2 
M 3 



Then 



E[x 3 <g> x 2 ]E[xi <g> x 2 ]~ 1 x\ 
E[x 3 ® xi]E[x 2 <g> ^i]~ 1 x 2 
E[xi ® x 2 ] 
E[£i 0X2® ^3]- 



M 2 = y^Wj fj, 3ti <g> /U 3) i 
i=l 
fc 

M 3 = ^ itfj /i 3j * ® /i 3)i ® )U 3j j. 

1=1 

We now discuss three examples (taken mostly from [AHK12!) where the above observations 
can be applied. The first two concern mixtures of product distributions, and the last one is the 
time-homogeneous hidden Markov model. 



Mixtures of axis-aligned Gaussians and other product distributions 

The first example is a mixture of k product distributions in M n under a mild incoherence assump- 
tion [AHK12 . Here, we allow each of the k component distributions to have a different product 
distribution (e.g., Gaussian distribution with an axis-aligned covariance matrix), but require the 
matrix of component means A := [//i|/x 2 | • • • \fik] £ M nxfc to satisfy a certain (very mild) incoherence 
condition. The role of the incoherence condition is explained below. 

For a mixture of product distributions, any partitioning of the dimensions [n] into three groups 
creates three (possibly asymmetric) "views" which are conditionally independent once the mixture 



component is selected. However, recall that Theorem 3.6 requires that for each view, the k condi- 



tional means be linearly independent. In general, this may not be achievable; consider, for instance, 
the case \ii = e, for each i G [k]. Such cases, where the component means are very aligned with the 
coordinate basis, are precluded by the incoherence condition. 

Define coherence(A) := maxj g r„i{ejn j 4e.j} to be the largest diagonal entry of the orthogonal 
projector to the range of A, and assume A has rank k. The coherence lies between k/n and 1; it 
is largest when the range of A is spanned by the coordinate axes, and it is k/n when the range is 
spanned by a subset of the Hadamard basis of cardinality k. The incoherence condition requires, 
for some e, S G (0, 1), 

coherence(A) < ] -^. 

Essentially, this condition ensures that the non- degeneracy of the component means is not isolated 
in just a few of the n dimensions. Operationally, it implies the following. 

Proposition 3.2 QAHK12]). Assume A has rank k and coherence(il) < (e 2 /6)/ln(3/c/5) for some 
£, 8 G (0, 1). With probability at least 1 — 5, a random partitioning of the dimensions [n] into three 
groups (for each i G [n], independently pick t G {1, 2, 3} uniformly at random and put i in group t) 
has the following property. For each t G {1,2,3} and j G [k], let [i t ,j be the entries of /ij put into 
group t, and let At := [/^t,i|^, 2 | ■ ■ ■ |A*tfc]- Then for each t G {1,2,3}, At has full column rank, and 
the k-th largest singular value of At is at least — e)/3 times that of A. 
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Therefore, three asymmetric views can be created by randomly partitioning the observed ran- 
dom vector x into x\, X2, and xs, such that the resulting component means for each view satisfy 
the conditions of Theorem 13.61 



Spherical Gaussian mixtures, revisited 



Consider again the case of spherical Gaussian mixtures (cf. Section 3.2). As we shall see in Sec- 



tion 4.3, the previous techniques (based on Theorem |3.2| and Theorem 3.3) lead to estimation 
procedures when the dimension of x is k or greater (and when the k component means are linearly 
independent). We now show that when the dimension is slightly larger, say greater than 3k, a 
different (and simpler) technique based on the multi-view structure can be used to extract the 
relevant structure. 

We again use a randomized reduction. Specifically, we create three views by (i) applying a 
random rotation to x, and then (ii) partitioning x G W 1 into three views xi,X2,X3 G M. d for 
d := re/3. By the rotational invariance of the multivariate Gaussian distribution, the distribution 
of x after random rotation is still a mixture of spherical Gaussians (i.e., a mixture of product 
distributions), and thus xi,X2,X3 are conditionally independent given h. What remains to be 
checked is that, for each view t G {1,2,3}, the matrix of conditional means of xt for each view 
has full column rank. This is true with probability 1 as long as the matrix of conditional means 
A := | /j,2 1 • • • \k L k] G M nxk has rank k and re > 3fc. To see this, observe that a random rotation 
in M. n followed by a restriction to d coordinates is simply a random projection from 1" to W 1 , and 
that a random projection of a linear subspace of dimension k to M. d is almost surely injective as 
long as d > k. Applying this observation to the range of A implies the following. 

Proposition 3.3 ([HK12]). Assume A has rank k and that re > 3k. Let R G IR nxn be chosen 
uniformly at random among all orthogonal n x n matrices, and set x := Rx G W 1 and A := RA = 
[Rfii\R[i2 \ • • • \RfJ-k] G M. nxk . Partition [re] into three groups of sizes d\, d2,d% with dt > k for each 
t G {1,2,3}. Furthermore, for each t, define xt G M rft (respectively, At G M. dtXk ) to be the subvector 
of x (resp., submatrix of A) obtained by selecting the dt entries (resp., rows) in the t-th group. 
Then x\,X2,x^ are conditionally independent given h; K[xt\h = j] = AtCj for each j G [k] and 
t G {1,2,3}; and with probability 1, the matrices Ai,A2,A% have full column rank. 

It is possible to obtain a quantitative bound on the k-th largest singular value of each At in 



terms of the fc-th largest singular value of A (analogous to Proposition 3.2). One avenue is to 



show that a random rotation in fact causes A to have low coherence, after which we can apply 



Proposition 3.2 With this approach, it is sufficient to require n = O(klogk) (for constant e and 
5) , which results in the k-th largest singular value of each At being a constant fraction of the k-th 
largest singular value of A. We conjecture that, in fact, n > c • k for some c > 3 suffices. 

Hidden Markov models 

Our last example is the time-homogeneous HMM for sequences of vector-valued observations 
x\,X2, ■ ■ ■ G R . Consider a Markov chain of discrete hidden states y\ — > y2 — > V3 — > ■ ■ ■ over 
k possible states [k]; given a state yt at time t, the observation xt at time t (a rando m vect or taking 



values in R ) is independent of all other observations and hidden states. See Figure 1(b) 
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Let 7r G A fc 1 be the initial state distribution (i.e., the distribution of yi), and T G M fcxfc be 
the stochastic transition matrix for the hidden state Markov chain: for all times t, 

Prfe/t+i = i\yt = j) = T id , i,j G [k). 

Finally, let O G M. dxk be the matrix whose j-th. column is the conditional expectation of xt given 
yt = j: for all times t, 

E[x t \y t =j] = Oe j , j G [k]. 

Proposition 3.4 ([AHK12J). Define h := y%, where yi is the second hidden state in the Markov 
chain. Then 

• xi,X2,X3 are conditionally independent given h; 

• the distribution of h is given by the vector w := Ttt G A fc_1 ; 

• for all j G [k], 

E[xi\h = j] = Odiag(vr)T T diag(w)" 1 e i 
E[x 2 \h = j] = Oej 
E[x 3 \h = j] = OTej. 

Note the matrix of conditional means of Xt has full column rank, for each t G {1, 2, 3}, provided 
that: (i) O has full column rank, (ii) T is invertible, and (iii) ir and Ttt have positive entries. 

4 Orthogonal tensor decompositions 

We now show how recovering the /Vs in our aforementioned problems reduces to the problem of a 
finding a certain orthogonal tensor decomposition of a symmetric tensor. We start by reviewing the 
spectral decomposition of symmetric matrices, and then discuss a generalization to the higher-order 
tensor case. Finally, we show how orthogonal tensor decompositions can be used for estimating the 
latent variable models from the previous section. 



4.1 Review: the matrix case 

We first build intuition by reviewing the matrix setting, where the desired decomposition is the 
eigendecomposition of a symmetric rank-A: matrix M = VAV T , where V = [v\\v2\ • • • \vk] G W ixk 
is the matrix with orthonormal eigenvectors as columns, and A = diag(Ai, A2, . . . , Xk) G ~R kxk is 
diagonal matrix of non-zero eigenvalues. In other words, 

k k 

M = £>^ T = J>*;f 2 . (1) 
i=i i=i 

Such a decomposition is guaranteed to exist for every symmetric matrix. 

Recovery of the v^s and Aj's can be viewed at least two ways. First, each V{ is fixed under the 
mapping u \-t Mu, up to a scaling factor Xf. 

k 

Mvi = y^Xj(vJvi)vj = XiVi 
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as vjvi = for all j ^ iby orthogonality. The «j's are not necessarily the only such fixed points. For 
instance, with the multiplicity Ai = A2 = A, then any linear combination of v± and V2 is similarly 
fixed under M. However, in this case, the decomposition in ([I]) is not unique, as \iv\vj + \2V2V2 is 
equal to X(u\uJ + ii2uj) for any pair of orthonormal vectors, u\ and U2 spanning the same subspace 
as vi and V2- Nevertheless, the decomposition is unique when Aj, A2, . . . , A& are distinct, whereupon 
the Vj 7 s are the only directions fixed under u *— > Mu up to non-trivial scaling. 

The second view of recovery is via the variational characterization of the eigenvalues. Assume 
Ai > A2 > • • • > Afc; the case of repeated eigenvalues again leads to similar non- uniqueness as 
discussed above. Then the Rayleigh quotient 

u T Mu 

U H-» 

is maximized over non-zero vectors by v\. Furthermore, for any s £ [k], the maximizer of the 
Rayleigh quotient, subject to being orthogonal to vi,V2, ■ ■ ■ ,v s -i, is v s . Another way of obtaining 
this second statement is to consider the deflated Rayleigh quotient 

u i-> — 

and observe that v s is the maximizer. 

Efficient algorithms for finding these matrix decompositions are well studied |GvL961 Section 
8.2.3], and iterative power methods are one effective class of algorithms. 

We remark that in our multilinear tensor notation, we may write the maps u *— > Mu and 
u i->- M T Mu/||n||2 as 

u^-Mu = u\->M(I,u), (2) 

u T Mu M(u,u) , . 

u^f = u^ — ^— L . (3 

U T U U T U 



4.2 The tensor case 

Decomposing general tensors is a delicate issue; tensors may not even have unique decompositions. 
Fortunately, the orthogonal tensors that arise in the aforementioned models have a structure which 
permits a unique decomposition under a mild non-degeneracy condition. We focus our attention to 
the case p = 3, i.e., a third order tensor; the ideas extend to general p with minor modifications. 

An orthogonal decomposition of a symmetric tensor T £ (££) 3 W 1 is a collection of orthonormal 
(unit) vectors {v±, V2, ■ ■ ■ , v^} together with corresponding positive scalars Aj > such that 

k 

T = ^A^f. (4) 

-i=i 

Note that since we are focusing on odd-order tensors (p = 3), we have added the requirement that 
the Aj be positive. This convention can be followed without loss of generality since —Xivf p = 
Aj(— Vi)® p whenever p is odd. Also, it should be noted that orthogonal decompositions do not 
necessarily exist for every symmetric tensors. 

In analogy to the matrix setting, we consider two ways to view this decomposition: a fixed-point 
characterization and a variational characterization. Related characterizations based on optimal 
rank-1 approximations can be found in |ZG01| . 
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Fixed-point characterization 



For a tensor T, consider the vector-valued map 

u i y T(I, u, u) (5) 
which is the third-order generalization of ([2]). This can be explicitly written as 



T(I,u,u)= Tijj(e]u)(e{u)ei. 

l<j,l<d 

Observe that ^ is not a linear map, which is a key difference compared to the matrix case. 

An eigenvector u for a matrix M satisfies M(I, u) = Xu, for some scalar A. We say a unit vector 
u G W 1 is an eigenvector of T, with corresponding eigenvalue A G M, if 

T(I, u, u) = An. 

(To simplify the discussion, we assume throughout that eigenvectors have unit norm; otherwise, 
for scaling reasons, we replace the above equation with T(I,u,u) = X\\u\\u.) For orthogonally 
decomposable tensors T = Y2i=i ^i v f 3 > 

k 

T(I,u,u) = S ^\i(u T v i ) 2 Vi . 

i=l 

By the orthogonality of the V{, it is clear that T{I ,Vi,Vi) = XiVi for all i G [k]. Therefore each 
(vi, Xi) is an eigenvector /eigenvalue pair. 

There are a number of subtle differences compared to the matrix case that arise as a result 
of the non-linearity of ([5]). First, even with the multiplicity Ai = A2 = A, a linear combination 
u := C\v 1 + C2V2 may not be an eigenvector. In particular, 

T(I,u,u) = Aic^fi + X2C2V2 = X(c\v\ + c\v2) 

may not be a multiple of c\V\ + C2t>2- This indicates that the issue of repeated eigenvalues does not 
have the same status as in the matrix case. Second, even if all the eigenvalues are distinct, it turns 
out that the v^s are not the only eigenvectors. For example, set u := {1/Xi)v\ + (1/A2)t>2- Then, 

T(I, u, u) = Ai(l/Ai) V + A 2 (l/A 2 ) 2 i;2 = «, 

so n/||u|| is an eigenvector. More generally, for any subset S C [k], we have that ^ieS^/^*)^ i s 
(proportional to) an eigenvector. 

As we now see, these additional eigenvectors can be viewed as spurious. We say a unit vector 
u is a robust eigenvector of T if there exists an e > such that for all 8 G {u' G M n : \\u' — u\\ < e}, 
repeated iteration of the map 

starting from 9 converges to u. Note that the map ^ rescales the output to have unit Euclidean 
norm. Robust eigenvectors are also called attracting fixed points of ^ (see, e.g., |KM11] ). 

The following theorem implies that if T has an orthogonal decomposition as given in @, then 
the set of robust eigenvectors of T are precisely the set {i>i, i>2, . . . vt}, implying that the orthogonal 
decomposition is unique. (For even order tensors, the uniqueness is true up to sign-flips of the t> j.) 
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Theorem 4.1. Let T have an orthogonal decomposition as given in Q. 

1. The set of 9 G W 1 which do not converge to some Vi under repeated iteration of Q has 



measure zero. 



2. The set of robust eigenvectors of T is equal to {v\,V2, • • • -,Vk}- 



The proof of Theorem 4.1 is given in Appendix A.l[ and follows readily from simple orthogo- 



nality considerations. Note that every Vi in the orthogonal tensor decomposition is robust, whereas 
for a symmetric matrix M, for almost all initial points, the map 9 i— > pffij converges only to an 
eigenvector corresponding to the largest magnitude eigenvalue. Also, since the tensor order is odd, 
the signs of the robust eigenvectors are fixed, as each — v% is mapped to Vi under Q. 

Variational characterization 

We now discuss a variational characterization of the orthogonal decomposition. The generalized 
Rayleigh quotient |ZG01| for a third-order tensor is 

T(u, u, u) 
U ^ (n T n)3/2 ' 

which can be compared to ([3]). For an orthogonally decomposable tensor, the following theorem 
shows that a non-zero vector u G R" is an isolated local maximizer [NW99] of the generalized 
Rayleigh quotient if and only if u = Vi for some i G [k] . 

Theorem 4.2. Assume n > 2. Let T have an orthogonal decomposition as given in Q, and 
consider the optimization problem 

m&xT{u,u,u) s.t. \\u\\ = 1. 



1. The stationary points are eigenvectors ofT. 

2. A stationary point u is an isolated local maximizer if and only if u = Vi for some i G [k] . 



The proof of Theorem 4.2 is given in Appendix A.2| It is similar to local optimality analysis 



for ICA methods using fourth-order cumulants {e.g., [DL95, FJK96 ). 

Again, we see similar distinctions to the matrix case. In the matrix case, the only local maximiz- 
ers of the Rayleigh quotient are the eigenvectors with the largest eigenvalue (and these maximizers 
take on the globally optimal value). For the case of orthogonal tensor forms, the robust eigenvectors 
are precisely the isolated local maximizers. 

An important implication of the two characterizations is that, for orthogonally decomposable 
tensors T, (i) the local maximizers of the objective function u \- V T(u, u, u) / (u T u) 3 ^ 2 correspond 
precisely to the vectors Vi in the decomposition, and (ii) these local maximizers can be reliably 
identified using a simple fixed-point iteration {i.e., the tensor analogue of the matrix power method). 
Moreover, a second-derivative test based on T{L, I, u) can be employed to test for local optimality 
and rule out other stationary points. 
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4.3 Estimation via orthogonal tensor decompositions 

We now demonstrate how the moment tensors obtained for various latent variable models in Sec- 
tion [3] can be reduced to an orthogonal form. For concreteness, we take the specific form from the 



exchangeable single topic model (Theorem 3.1): 



k 

M 2 = JjWj Hi ® 
i=i 

k 

M 3 = m <g> m <8> Hi- 

i=l 

(The more general case allows the weights W{ in M 2 to differ in M3 , but for simplicity we keep them 
the same in the following discussion.) We now show how to reduce these forms to an orthogonally 
decomposable tensor from which the Wi and Hi can be recovered. See Appendix[D]for a discussion as 
to how previous approaches |MR061 IAHK121 IAFH+ 12l IHK12] achieved this decomposit ion through 



a certain simultaneous diagonalization method. 

Throughout, we assume the following non-degeneracy condition. 

Condition 4.1 (Non-degeneracy). The vectors Hl>H2i ■■ ■ >Hk £ ^ are linearly independent, and 
the scalars wi, w 2 , ■ ■ ■ , Wf. > are strictly positive. 



Observe that Condition |4.1| implies that M 2 ^ is positive semidefinite and has xank-k. This 
is a mild condition; furthermore, when this condition is not met, learning is conjectured to be hard 
for both computational [MR06] and information-theoretic reasons |MV10j . 

The reduction 

First, let W G M dxfc be a linear transformation such that 

M 2 (W,W) = W T M 2 W = I 

where I is the k x k identity matrix (i.e., W whitens M 2 ). Since M 2 y 0, we may for concreteness 
take W := UD~ 1 / 2 , where U G M. dxk is the matrix of orthonormal eigenvectors of M 2 , and D G M. hxk 
is the diagonal matrix of positive eigenvalues of M 2 . Let 

Hi := ^pWi W T Hi- 
Observe that 

k k 

M 2 (W,W) = ^2 wT (V^^)(Vm^iVW = Ytfcfi = L 
i=l i=l 

so the Hi G M fe are orthonormal vectors. 

Now define M 3 := M 3 (W, W, W) G R kxkxk , so that 



k k 



i=l i=l 



1 ~5 
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As the following theorem shows, the orthogonal decomposition of M3 can be obtained by identifying 
its robust eigenvectors, upon which the original parameters Wi and m can be recovered. For 
simplicity, we only state the result in terms of robust eigenvector /eigenvalue pairs; one may also 
easily state everything in variational form using Theorem |4.2| 



Theorem 4.3. Assume Condition 4-1 and take M3 as defined above 



1. The set of robust eigenvectors of M3 is equal to {/ii,/22, • • • ,fik}- 

2. The eigenvalue corresponding to the robust eigenvector fli of M3 is equal to l/y/wl, for all 
ie[k]. 

3. If B E M. dxk is the Moore-Penrose pseudoinverse of W T , and (v,X) is a robust eigenvec- 
tor/eigenvalue pair of M3, then XBv = m for some i E [k\. 

The theorem follows by combining the above discussion with the robust eigenvector characteri- 



zation of Theorem 4.1 Recall that we have taken as convention that eigenvectors have unit norm, so 
the ^ are exactly determined from the robust eigenvector /eigenvalue pairs of M3 (together with the 
pseudoinverse of W T ); in particular, the scale of each pn is correctly identified (along with the cor- 
responding Wi). Relative to previous works on moment-based estimators for latent variable models 
(e.g., |AHK12[ |AFH + I2l IHK12] ). Theorem 4.3 emphasizes the role of the special tensor structure, 



which in turn makes transparent the applicability of methods for orthogonal tensor decomposition. 
Local maximizers of (cross moment) skewness 

The variational characterization provides an interesting perspective on the robust eigenvectors for 



these latent variable models. Consider the exchangeable single topic models (Theorem 3.1), and 
the objective function 

K[(xju)(x2u)(xju)] Ms(u,u,u) 



u 1 y 



E[(xju)(x^u)} 3 / 2 M 2 (u,u) 3 / 2 ' 

In this case, every local maximizer u* satisfies M 2 (I,u*) = ^Jw~i\Xi for some t'G [k]. The objective 
function can be interpreted as the (cross moment) skewness of the random vectors xi,x 2 , X3 along 
direction u. 



5 Tensor power method 

In this section, we consider the tensor power method of [LMVOO, Remark 3] for orthogonal tensor 
decomposition. We first state a simple convergence analysis for an orthogonally decomposable 
tensor T. 

When only an approximation T to an orthogonally decomposable tensor T is available (e.g., 
when empirical moments are used to estimate population moments), an orthogonal decomposition 
need not exist for this perturbed tensor (unlike for the case of matrices), and a more robust 
approach is required to extract the approximate decomposition. Here, we propose such a variant 
in Algorithm [T] and provide a detailed perturbation analysis. We note that alternative approaches 
such as simultaneous diagonalization can also be employed (see Appendix Id]) . 
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5.1 Convergence analysis for orthogonally decomposable tensors 



The following lemma establishes the quadratic convergence of the tensor power method (i.e., re- 
peated iteration of Q) for extracting a single component of the orthogonal decomposition. Note 
that the initial vector 9q determines which robust eigenvector will be the convergent point. Compu- 
tation of subsequent eigenvectors can be computed with deflation, i.e., by subtracting appropriate 
terms from T. 

Lemma 5.1. Let T G (££) 3 ]R n have an orthogonal decomposition as given in @. For a vector 9q G 
W 1 , suppose that the set of numbers \XivJ9o\, l-^vj^ol; • • • , lA^i^ol has a unique largest element. 
Without loss of generality, say lAit^ol is this largest value and \\2V2 &o\ is the second largest value. 
Fort = 1,2,..., let 

= T(I, flt-i, flt-i) 



Then 



i=2 



k \ 2 t + 1 



That is, repeated iteration of ^ starting from 9q converges to v\ at a quadratic rate. 

To obtain all eigenvectors, we may simply proceed iteratively using deflation, executing the 
power method on T—^j ^j v f 3 after having obtained robust eigenvector / eigenvalue pairs {(vj, Xj)}- 

Proof. Let 9q, 9\, 62, ■ ■ ■ be the sequence given by 9q := 9q and 9t ■= T(I, 9t-i, Ot-l) for t > 1. Let 
Cj := vJOq for all i G [k\. It is easy to check that (i) 9t = 9 t /\\9 t \\, and (ii) 9 t = Yli=i ~ lc T v i- 
(Indeed, 9 t+1 = £* =1 = Eti A* (A?* ^cf f v, = £* =1 Af^cfV) Then 



A2C2 



1 - l«l PtJ - 1 - „77 ,|2 - 1 ~ ^fc 2 t+l_ 2 2t+1 < ^fc ,2*+i-2 2*+i - Al 2^ A * 



Aici 



Since Ai > 0, we have vj9t > and hence \\v\ — 9t\\ = 2(1 — vJ9t) < 2(1 — (vJ9t) ) as required. □ 



5.2 Perturbation analysis of a robust tensor power method 

Now we consider the case where we have an approximation T to an orthogonally decomposable 
tensor T. Here, a more robust approach is required to extract an approximate decomposition. We 
propose such an algorithm in Algorithm [T] and provide a detailed perturbation analysis. 

Assume that the symmetric tensor T G M fcxfcxfc is orthogonally decomposable, and that T = 
T + E, where the perturbation E G M fcxfcxfc is a symmetric tensor with small operator norm: 

:= sup \E(9,9,9)\. 

11*11=1 

In our latent variable model applications, T is the tensor formed by using empirical moments, 
while T is the orthogonally decomposable tensor derived from the population moments for the 
given model. 

The following theorem is similar to Wedin's perturbation theorem for singular vectors of matri- 
ces |Wed72j in that it bounds the error of the (approximate) decomposition returned by Algorithm[T] 
on input T in terms of the size of the perturbation, provided that the perturbation is small enough. 
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Algorithm 1 Robust tensor power method 



input symmetric tensor T £ ^kxkxk^ nuni b er Q f iterations L, N. 
output the estimated eigenvector /eigenvalue pair; the deflated tensor. 
1: for r = 1 to L do 

(t) I 

2: Draw 8q uniformly at random from the unit sphere in M . 

3: for t = 1 to JV do 

4: Compute power iteration update 

1 ' \\f(i,ei T _\,et\)\\ 

5: end for 
6: end for 

7: Let t* := argmax re[i] {T 0%>)}. 

starting from 0]y to obtain 9, and set A := T(9, 9, 8). 
9: return the estimated eigenvector /eigenvalue pair (9, A); the deflated tensor T — A #® 3 . 



Theorem 5.1. Let f = T + E G M fcxfcxfc , ^ere T is a symmetric tensor with orthogonal decom- 
position T = Yli=i ^i v f 3 where each A, > 0, {ui,«2j • • • j^fc} * s an orthonormal basis, and E has 
operator norm e := Define A m i n := min{Aj : i E [fc]}, and A max := max{Aj : i £ [k]}. There 

exists universal constants Ci, 62,63 > such that the following holds. Pick any 7] G (0,1), and 
suppose 

e < d • A |^, iV > C 2 • flog(fe) + loglog( A,! 



and 



'ln(L/log 2 (fcA7)) 



ln(k) 



ln(ln(L/log 2 (fc/T ? ))) + C 3 
41n(L/log 2 (fc/7?)) 



ln(8) 



ln(L/log 2 (fc/7/)) 



> 1.02 1 + 



'ln(4)\ 
ln(fc) )" 



(Note that the condition on L holds with L = poly(fc) log(l/r/).j Suppose that Algorithm^ is 
iteratively called k times, where the input tensor is T in the first call, and in each subsequent call, 
the input tensor is the deflated tensor returned by the previous call. Let (v±, Ai), (#2, A2), . . . , A&) 
be the sequence of estimated eigenvector/eigenvalue pairs returned in these k calls. With probability 
at least 1 — rj, there exists a permutation ir on [k] such that 



'-Mi) 



< 8e/A 



|A 



Xj\<5e, Vj£[k], 



and 



3 V J 



< 55e. 



The proof of Theorem 5.1 is given in Appendix |Bj 

One important difference from Wedin's theorem is that this is an algorithm dependent pertur- 
bation analysis, specific to Algorithm [T] (since the perturbed tensor need not have an orthogonal 
decomposition). Furthermore, note that Algorithm [T] uses multiple restarts to ensure (approximate) 
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convergence — the intuition is that by restarting at multiple points, we eventually start at a point 
in which the initial contraction towards some eigenvector dominates the error E in our tensor. The 
proof shows that we find such a point with high probability within L = poly(fc) trials. It should be 
noted that for large k, the required bound on L is very close to linear in k. 

We note that it is also possible to design a variant of Algorithm [T] that instead uses a stop- 
ping criterion to determine if an iterate has (almost) converged to an eigenvector. For instance, 
if T(0,9,9) > max{||T||p'/v / 2r, ||T(J, J, 0)||^/1.O5}, where \\T\\f is the tensor Frobenius norm 
(vectorized Euclidean norm), and r is the expected rank of the unperturbed tensor (r = k — 
# of deflation steps), then it can be shown that 8 must be close to one of the eigenvectors, pro- 
vided that the perturbation is small enough. Using such a stopping criterion can reduce the number 
of random restarts when a good initial point is found early on. See Appendix [C] for details. 

In general, it is possible, when run on a general symmetric tensor (e.g., T), for the tensor power 



method to exhibit oscillatory behavior [KR02, Example 1]. This is not in conflict with Theorem 5.1 



which effectively bounds the amplitude of these oscillations; in particular, if T = T + E is a tensor 
built from empirical moments, the error term E (and thus the amplitude of the oscillations) can 
be driven down by drawing more samples. The practical value of addressing these oscillations and 
perhaps stabilizing the algorithm is an interesting direction for future research [KMllj . 

A final consideration is that for specific applications, it may be possible to use domain knowl- 
edge to choose better initialization points. For instance, in the topic modeling applications (c/. Sec- 
tion 3.1), the eigenvectors are related to the topic word distributions, and many documents may 
be primarily composed of words from just single topic. Therefore, good initialization points can 
be derived from these single-topic documents themselves, as these points would already be close to 
one of the eigenvectors. 



6 Discussion 

6.1 Practical implementation considerations 

A number of practical concerns arise when dealing with moment matrices and tensors. Below, we 



address two issues that are especially pertinent to topic modeling applications |AHK12|, lAFH + 12j 
(or other settings where the observations are sparse). 

Efficient moment representation for exchangeable models 

In an exchangeable bag-of- words model, it is assumed that the words x%, X2, ■ ■ ■ , X£ in a document 
are conditionally i.i.d. given the topic h. This allows one to estimate p-th. order moments using 



just p words per document. The estimators obtained via Theorem 3.1 (single topic model) and 



Theorem 3.5 (LDA) use only up to third-order moments, which suggests that each document only 
needs to have three words. 

In practice, one should use all of the words in a document for efficient estimation of the moments. 
One way to do this is to average over all (3) -3! ordered triples of words in a document of length 
t. At first blush, this seems computationally expensive (when t is large), but as it turns out, the 
averaging can be done implicitly. Let c £ 1^ be the word count vector for a document of length 
£, so Cj is the number of occurrences of word i in the document, and ^i=t c * = ^- Note that c is 
a sufficient statistic for the document. Then, the contribution of this document to the empirical 
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third-order moment tensor is given by 



11/ A , 

c <g> c ® c + 2 >^ q (ej ® ej ® ejj 

i=i 

d d d d d d 



(S 



3! 



a a a a a a \ 

QCj (e, <g> ej ® ej) - ^ ^ QCj (ej (8) ej <g> ej) - ^ ^ CjCj (e, ® ej (8) ej) ) . (8) 

1 o — 1 , — 1 o — 1 o — 1 o' — 1 / 



i=l j=l i=l j=l i=l j=l 

It can be checked that this quantity is equal to 

1 1 

(IT* 



(g) 



ordered word triple (a;, y, z) 

where the sum is over all ordered word triples in the document. A similar expression is easily 
derived for the contribution of the document to the empirical second-order moment matrix: 



c ® c — diag(c) ) . (9) 



1 1 

(J "21 

Note that the word count vector c is generally a sparse vector, so this representation allows for 
efficient multiplication by the moment matrices and tensors in time linear in the size of the document 
corpus (i.e., the number of non-zero entries in the term-document matrix). 

Dimensionality reduction 

Another serious concern regarding the use of tensor forms of moments is the need to operate 
on multidimensional arrays with Q(d 3 ) values (it is typically not exactly d 3 due to symmetry). 
When d is large (e.g., when it is the size of the vocabulary in natural language applications), even 
storing a third-order tensor in memory can be prohibitive. Sparsity is one factor that alleviates 
this problem. Another approach is to use efficient linear dimensionality reduction. When this 
is combined with efficient techniques for matrix and tensor multiplication that avoid explicitly 
constructing the moment matrices and tensors (such as the procedure described above), it is possible 
to avoid any computational scaling more than linear in the dimension d and the training sample 
size. 

Consider for concreteness the tensor decomposition approach for the exchangeable single topic 



model as discussed in Section 4.3 Using recent techniques for randomized linear algebra compu- 
tations (e.g, [HMT11J), it is possible to efficiently approximate the whitening matrix W G M. dxk 
from second-moment matrix M2 E ~R. dxd . To do this, one first multiplies M2 by a random matrix 
R E ~R dxk for some k' > k, and then computes the top k singular vectors of the product M2R. 
This provides a basis U G M. dxk whose span is approximately the range of M2. From here, an 
approximate SVD of U J M2U is used to compute the approximate whitening matrix W. Note that 
both matrix products M2R and U J M2U may be performed via implicit access to M2 by exploiting 
Q, so that M2 need not be explicitly formed. With the whitening matrix W in hand, the third- 
moment tensor M3 = Ms(W, W, W) G R fcxfcxfe can be implicitly computing via Q. For instance, 
the core computation in the tensor power method 0' := M^(I,9,9) is performed by (i) computing 
77 := W9, (ii) computing 7/ := M%(I,r],rj), and finally (iii) computing 9' := W T ij'. Using the fact 
that M3 is an empirical third-order moment tensor, these steps can be computed with 0(dk + N) 
operations, where N is the number of non-zero entries in the term-document matrix. 
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6.2 Computational complexity 



It is interesting to consider the computational complexity of the tensor power method in the dense 
setting where T £ jjfcxfcxfc j s orthogonally decomposable but otherwise unstructured. Each iter- 
ation requires 0(k 3 ) operations, and assuming at most k 1+s random restarts for extracting each 
eigenvector (for some small 5 > 0) and 0(log(A;) + loglog(l/e)) iterations per restart, the total 
running time is 0(k 5+s (log(fc) + loglog(l/e))) to extract all k eigenvectors and eigenvalues. 

An alternative approach to extracting the orthogonal decomposition of T is to reorganize T into 
a matrix M £ M fcxfc by flattening two of the dimensions into one. In this case, if T = Y2i=i ^^f" 3 ) 
then M = £* =1 A iVi (g) vec(f i (g> v{). This reveals the singular value decomposition of M (assuming 
the eigenvalues Ai, A2, • . . , A& are distinct), and therefore can be computed with 0(k 4 ) operations. 
Therefore it seems that the tensor power method is less efficient than a pure matrix-based approach 
via singular value decomposition. However, it should be noted that this matrix-based approach 
fails to recover the decomposition when eigenvalues are repeated, and it is unstable when the gap 
between eigenvalues is small. 

It is worth noting that the running times differ by roughly a factor of Q(k 1+S ), which can 
be accounted for by the random restarts. This gap can potentially be alleviated or removed by 
using a more clever method for initialization. Moreover, using special structure in the problem (as 
discussed above) can also improve the running time of the tensor power method. 



6.3 Sample complexity bounds 

Previous work on using linear algebraic methods for estimating latent variable models crucially rely 
on matrix perturbation analysis for deriving sample complexity bounds [MR06, HKZ12, AHK12, 
AF H + 12| IHK12] . The learning algorithms in these works are plug- in estimators that use empirical 
moments in place of the population moments, and then follow algebraic manipulations that result in 
the desired parameter estimates. As long as these manipulations can tolerate small perturbations of 
the population moments, a sample complexity bound can be obtained by exploiting the convergence 
of the empirical moments to the population moments via the law of large numbers. As discussed 
in Appendix |Dj these approaches do not directly lead to practical algorithms due to a certain 
amplification of the error (a polynomial factor of k, which is observed in practice). 

Using the perturbation analysis for the tensor power method, improved sample complexity 
bounds can be obtained for all of the examples discussed in Section [3} The underlying analysis 
remains the same as in previous works (e.g., |AFH+12l IHK12] ). the main difference being the 
accuracy of the orthogonal tensor decomposition obtained via the tensor power method. Relative 
to the previously cited works, the sample complexity bound will be considerably improved in its 



dependence on the rank parameter k, as T heorem 5.1 implies that the tensor estimation error (e.g. 



error in estimating M3 from Section 4.3) is not amplified by any factor explicitly depending on k 
(there is a requirement that the error be smaller than some factor depending on k, but this only 
contributes to a lower-order term in the sample complexity bound). See Appendix [P] for further 
discussion regarding the stability of the techniques from these previous works. 



6.4 Other perspectives 

The tensor power method is simply one approach for extracting the orthogonal decomposition 
needed in parameter estimation. The characterizations from Section 4.2 suggest that a number 
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of fixed point and variational techniques may be possible (and Appendix [D] provides yet another 
perspective based on simultaneous diagonalization) . One important consideration is that the model 
is often misspecified, and therefore approaches with more robust guarantees (e.g., for convergence) 
are desirable. Our own experience with the tensor power method (as applied to exchangeable topic 
modeling) is that while model misspecification does indeed affect convergence, the results can be 



very reasonable even after just a dozen or so iterations AFH + 12j . Nevertheless, robustness is likely 
more important in other applications, and thus the stabilization approaches, such as those proposed 
in |KK02l IRK031 IErd09l IKMTT] . may be advantageous. 
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A Fixed-point and variational characterizations of orthogonal ten- 
sor decompositions 

A.l Proof of Theorem 14.11 

Theorem 4.1 restated. Let T have an orthogonal decomposition as given in Q. 



1. The set of 9 G M n which do not converge to some Vi under repeated iteration of ^ has 



measure zero. 



2. The set of robust eigenvectors of T is {v\,V2, ■ ■ ■ , v^}. 

Proof. For a random choice of 9 G M n (under any distribution absolutely continuous with respect 
to Lebesgue measure), the values |AiuJ"0|, |A2t>J#|, . . . , \XkV^9\ will be distinct with probability 1. 
Therefore, there exists a unique largest value, say \XivJ9\ for some i £ [k], and by Lemma 5.1, we 
have convergence to V{ under repeated iteration of ([6]). Thus the first claim holds. 

We now prove the second claim. First, we show that every V{ is a robust eigenvector. Pick any 
i£ [k], and note that for a sufficiently small ball around Vi, we have that for all 9 in this ball, \ivJ9 



is strictly greater than XjvJ9 for j G [k] \ {i}. Thus by Lemma 5.1 , Vi is a robust eigenvector. Now 
we show that the Vi are the only robust eigenvectors. Suppose there exists some robust eigenvector 
u not equal to Vi for any i G [k\. Then there exists a positive measure set around u such that all 
points in this set converge to u under repeated iteration of This contradicts the first claim. □ 

A.2 Proof of Theorem I3~2l 



Theorem 4.2 restated. Assume n > 2. Let T have an orthogonal decomposition as given in Q ; 



and consider the optimization problem 

max T(-u,u, u) s.t. \\u\\ = 1. 

1. The stationary points are eigenvectors ofT. 

2. A stationary point u is an isolated local maximizer if and only if u = v% for some i G [k]. 

Proof. Consider the Lagrangian form of the corresponding constrained maximization problem over 
unit vectors u G M n : 

3 

C(u, A) := T(u, u, u) — -X(u T u — 1). 

Since 

V u C(u, A) = V u Hr Xi(vju) 3 - h(u T u - 1) \ = 3(r(J, u, u) - Xuj , 
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the stationary points u 6 W 1 (with ||u|| = 1) satisfy 

T(I,u,u) = Xu 

for some A 6 R, i.e., (u, X) is an eigenvector/eigenvalue pair of T. 

Now we characterize the isolated local maximizers. Extend {vi,V2, ■ ■ ■ ,Vk} to an orthonormal 
basis {v\,V2, ■ ■ ■ , v n } of W 1 . Now pick any stationary point u = J27=i c i v * w ^ tn ^ := T{u, u, u) = 
u T T(I,u,u). Then 

Ajcf = Xi(u T Vi) 2 = vjT(I ,u,u) = Xvju = Aq > 0, i € [fe], 

and thus 

V 2 £(u,A) = 6^AiCi UiuT" -3A/ = 3A( 2^^-/) 

i=l ^ ieCl ' 

where O := {i G [k] : q / 0}. This implies that for any unit vector w GW 1 , 

w T V 2 u C(u, X)w = 3A (2 J>» 2 - l) . 

The point u is an isolated local maximum if the above quantity is strictly negative for all unit 
vectors w orthogonal to u. We now consider three cases depending on the cardinality of Q and the 
sign of A. 

• Case 1: |f2| = 1 and A > 0. This means u = Vi for some i G [A;] (as u = — V{ implies 
A = — X{ < 0). In this case, 

w T V 2 u C(u,X)w = 3Xi(2(vJw) 2 - 1) = -3Aj < 

for all w G W 1 satisfying (u T w) 2 = (vjw) 2 = 0. Hence u is an isolated local maximizer. 

• Case 2: \Q\ > 2 and A > 0. Since > 2, we may pick a strict non-empty subset S C f2 and 
set 




where := ^2ies c h ^S c '■= l2ien\s c h an d Z '■= y/l/Zs + 1/Zs c - It is easy to check that 
HHP = SienC^i" 10 ) 2 = 1 an d uTti; = 0- Consider any open neighborhood U of u, and pick 
5 > small enough so that {t := \/l — S 2 u + 5w is contained in U. Set uq := VT— <5 2 w. By 
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Taylor's theorem, there exists e G [0, 5] such that, for u := u$ + ew, we have 

T(u, u, u) = T(u , n , no) + V u T(u, u, u) T (u - u ) + -(u - uo) T V 2 T(u, u, u)(u - uq) 

U=UQ Z 



(1 - 5 2 f /2 \ + 5y/l - 5 2 Xu T w + ^5 2 w T V 2 u T(u, u, u) 

k 

(1 - 5 2 f/ 2 \ + + 35 2 HvJ(uo + ew))(vjw) 2 



i=i 

k 



(1 - ,5 2 ) 3 / 2 A + 35 2 VT^P X MvJw) 2 + 3<5 2 e ^ Hvjwf 



i=l i=l 
k 

T„.,\3 



(1 _ ^3/2 A + 36 2^y 1 _ px Y(vjw) 2 + 35 2 e Y Ai(«» 

iesi i=i 

(1 _ (5 2 ) 3/2 A + 3 ^y^T^ A + 3(5 2 e £ A ,(u» 3 

2 = 1 



1 - + 0(5 4 ) N ) A + 3<5Vl-<$ 2 A + 35 2 e ^ Ai(t>» 3 

' i=l 



Since e < 5, for small enough 5, the RHS is strictly greater than A. This implies that u is not 
an isolated local maximizer. 

Case 3: = or A < 0. Note that if |0| = 0, then A = 0, so we just consider A < 0. 
Consider any open neighborhood U of u, and pick j 6 [n] and 5 > small enough so that 
u := Z _1 (u + Svj) is contained in U, where Z := a/T+ 2c j 5 + 5 2 . (Note that this is always 
possible because n > 2.) If Cj = 0, then clearly T(u,u,u) = (1 + 5 2 )~ 3 / 2 (A + 5 3 Xj) > A. 
Otherwise, 

T(u, u, u) = (1 + 2cj6 + 5 2 y 3/2 (t(u, u, u) + 3XjC 2 5 + 3AjC^ 2 + 5 3 Xj 
_ (1 + 3c j( 5 + 3<5 2 )A ^ <5 3 A j 



(1 + 2 Cj 5 + <5 2 ) 3 / 2 (1 + 2 Cj 5 + <5 2 ) 3 / 2 
= (1 - 0(c 2 5 2 ))A + (1 - 0(8))5 3 Xj > A 

for sufficiently small 5. Thus u is not an isolated local maximizer. 

From these exhaustive cases, we conclude that a stationary point u is an isolated local maximizer 
if and only if u = V\ for some i S [k] . □ 

Note that we require n > 2 because in the case n = 1, the unit sphere contains only two points 
(ei and — ei), and both are isolated. 



B Analysis of robust power method 



In this section, we prove Theorem 5.1 The proof is structured as follows. In Appendix |B.l 



we 



show that with high probability, at least one out of L random vectors will be a good initializer 
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for the tensor power iterations. An initializer is good if its projection onto an eigenvector is 



noticeably larger than its projection onto other eigenvectors. We then analyze in Appendix B.2 



the convergence behavior of the tensor power iterations. Relative to the proof of Lemma 5.1, this 
analysis is complicated by the tensor perturbation. We show that there is an initial slow convergence 
phase (linear rate rather than quadratic), but as soon as the projection of the iterate onto an 
eigenvector is large enough, it enters the quadratic convergence regime until the perturbation 
dominates. Finally, we show how errors accrue due to deflation in Appendix |B.3 which is rather 
subtle and different from deflation with matrix eigendecompositions. This is because when some 
initial set of eigenvectors and eigenvalues are accurately recovered, the additional errors due to 



deflation are effectively only lower-order terms. These three pieces are assembled in Appendix B.4 



to complete the proof of Theorem 5.1 



B.l Initialization 

Consider a set of non- negative numbers Ai, A2, • • • , A& > 0. For 7 £ (0, 1), we say a unit vector 
#0 £ M fc is ^-separated relative to i* £ [k] if 

Ai*|0j* n| - max \i\0ifl\ > 7A;|6>;* i0 | 

(the dependence on Ai, A2, • • • , A& is implicit). 

The following lemma shows that for any constant 7, with probability at least 1 — rj, at least one 
out of poly(/c) log(l/r/) i.i.d. random vectors (uniformly distributed over the unit sphere S k ~ 1 ) is 
7-separated relative to argmaxjg^ Aj. (For small enough 7 and large enough k, the polynomial is 
close to linear in A;.) 

Lemma B.l. There exists an absolute constant c > such that if positive integer L>2 satisfies 
[HL) ( ln(ln(L)) + c [HZ)] 1 [M*)\ nm 

the following holds. With probability at least 1/2 over the choice of L i.i.d. random vectors drawn 
uniformly distributed over the unit sphere S k ~ 1 in M. k , at least one of the vectors is ^-separated 
relative to argmax ig [ fc ] X, L . Moreover, with the same c, L, and for any r\ £ (0, 1), with probability 
at least 1 — n over L ■ log 2 (l/?7) i.i.d. uniform random unit vectors, at least one of the vectors is 
7 -separated. 

Proof. Without loss of generality, assume argmax ig [ fc j Aj = 1. Consider a random matrix Z £ 
jjfcxL w j lose en tries are independent A^(0, 1) random variables; we take the j-th column of Z to 
be comprised of the random variables used for the j-th random vector (before normalization). 
Specifically, for the j-th random vector, 

9 i;0 := f' j = , i £ [n\. 

yJ2i'=i z f',j 

It suffices to show that with probability at least 1/2, there is a column j* £ [L] such that 

\Zi 7*1 > max I^i7*|. 

1-7 ie[fc]\{i} ,J 



33 



Since max Jg m \Zx t j\ is a 1-Lipschitz function of L independent AA(0, 1) random variables, it 
follows that 

> V 2 ln (8) < 1/4. 



Pr 



max I Z\ n I — median 



Moreover, 



median 



max \Z-\ ,• I 
■ie[L]' Jl 



max \ Z\ j\ 



> median 



maxZ 



1J 



=: m. 



Observe that the cumulative distribution function of maxj g [^] Z\j is given by F(z) = $(z) , where 
$ is the standard Gaussian CDF. Since F(m) = 1/2, it follows that m = <J> _1 (2 _1 / L ). It can be 
checked that 

for some absolute constant c > 0. Also, let j* := argmax^g^ \Z\j\. 

Now for each j 6 [L], let |^2:fcj| := m ax{|Z2j|, |^3j|, • • • , \Zkj\}. Again, since |^2:fcj| is a 
1-Lipschitz function of k — 1 independent AA(0, 1) random variables, it follows that 



Pr 



|^2:fc,i| >E 



Z 2:fcjj |j + V21n(4) 



< 1/4. 



Moreover, by a standard argument, 



E 



\Z2:k, 



< y/2\n(k). 



Since |^2:fcj| is independent of for all j € [L], it follows that the previous two displayed 

inequalities also hold with j replaced by j* . 

Therefore we conclude with a union bound that with probability at least 1/2, 

\Z hr \ > y / 21n(L) - HHL)) + c _ and \ Z ^\ < ^2ln(k) + ^2ln(^. 

2-^/2 m(L) 



Since L satisfies (10) by assumption, in this event, the j*-th random vector is 7-separated. 



□ 



B.2 Tensor power iterations 

Recall the update rule used in the power method. Let 9t = Ym=i @i,tVi 6 M fc be the unit vector at 
time t. Then 



i=l 



In this subsection, we assume that T has the form 

k 



i=i 



where {i>i,i>2, • • • ,Vk} is an orthonormal basis, and, without loss of generality, 

Ai|0it| = max \i 1 9 it \ > 0. 

ie[fc] 



(11) 



34 



Also, define 

A min := min{Aj : i G [k], A* > 0}, A max := max{Aj : % G [k]}. 
We further assume the error E is a symmetric tensor such that, for some constant p > 1, 

\\E(I, u,u) || < e, VUGS'*- 1 ; 

||£(I, u, u)|| < e/p, Vu G S^ 1 s.t. (u T vi) 2 > 1 - (3e/Ai) 2 . 



(12) 
(13) 



In t he n ext two propositions (Propositions B.l and B.2) and next two lemmas (Lemmas B.2 
and B.3), we analyze the power method iterations using T at some arbitrary iterate 0t using 



only the property (12) of E. But throughout, the quantity e can be replaced by e/p if 9t satisfies 
(01 



1T v\) 2 > 1 — (3e/Ai) 2 as per property (13). 



Define 



0lr 



1 - *L 



1/2 



1 



mm^i |r i)T | 



5r 



Al^l,T 

Ai|^i, T | 

e 



(14) 



k :- 



for r G {t,t + 1}. 
Proposition B.l. 



mm \r{ t > — 



Proposition B.2. 



-Rt+i > -R* • 



7t > 1 



1 + K<5 t r 2 t 

I -s t 



K 

Rt' 



9 2 



Rl 



i + Rr 



i-s t 

i- + 



i G [fc], 



> 



I -St 



1 - 7t + <5 t #t - + <5 t 



(15) 
(16) 



Proof. Let m == f(I,6 t ,0 t ), so = t+ i/||^+i||. Since M+1 = f{v h 9 t ,9 t ) = T{ Vi ,9 u 9 t ) + 
E(vi,0t,O t ), we have 

6 i l+i = \i0lt + E{v i ,6 u t ), i€ [A:]. 
Using the triangle inequality and the fact ||-E(fj, 0t, #t)|| < e, we have 



and 



Qi,t+l — Xi&i,t — e > \Gi,t\ • ^Ajl^tl — e/|#i,t 
\0i,t+i\ < |Ai«itl + e < \0i,t\ • (%|0;, t | + e/\9 



i,t\ 



(17) 
(18) 



for all i G [fc]. Combining (17) and (18) gives 
Ai0i,t+i 



r i,t+l 



Xi\9 



Xi6i,t+i 2 



I -s t 



I -St 



i\ u i,t+l\ 



Ai| 9i, 



t+l\ 



1 + 



1 + {\i/M)8 t r. t 



> r 



I -St 



t.t 



1 + KS t r\ t ' 
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Moreover, by the triangle inequality and Holder's inequality, 

1/2 / n „x 1/2 



( \ l / z / 2\ 
i=2 ' M=2 ' 

n \ 1/2 / n \ 1/2 

H=2 ' ^i=2 ' 

, n v 1/2 
<maxAi|^ t |I^^J +e 

= (1 - ^ t ) 1/2 • (max A 4 |^, t | + 67(1 - ^L) 1 / 2 ) . (19) 



Combining (17) and (19) gives 



ri,t+i| _ ri,M-l| > ri,tl Ail^l — e/|0i,t| 



(i - ^,m) i/2 (Er =2 ^,*+i] 2 ) 1/2 " (1 " ^* )1/a max ^ 1 ~ mA + 2/(1 " ^ )1/2 ' 

In terms of i?t+i, "ft, and <5t, this reads 

I -St _ I -St i -s t I -S t 



where the last inequality follows from Proposition B.l □ 
Lemma B.2. Fix any p > 1. Assume 



< St < min 



1 

2(1 + 2np 2 ) ' 1 + «p J 



and 7 t > 2(1 + 2Kp 2 )5 t . 

1. Ifr 2 it < 2p 2 , then r i>t+1 > \r i>t \(l + f ). 

S. ///9 2 < r? tJ taen r M+1 > min^/p, ^^}- 
5. 7t+i > min{7 t , 1 - 1/p}. 

I Ifmm^rl > (p(l - 5 t ) - l)/(«<St), tten i? m > ■ ^ • 

5. IfRt<l + 2k P 2 , then R t+l >R t (l + f), 6 2 l t+l > 0? t) and 8 t+x < 5 t . 
Proof. Consider two (overlapping) cases depending on r 2 t . 



• Case 1: rf t < 2p . By (15) from Proposition 



B.2 



^2 1 ~St . | | 1 I ~S t . ./ 7* 

n,t+i > r it ■ — — t-j- > \n,t\ ■ - — — - — 2F > \n, t \ 1 1 + — 

' 1 + K<5 t rf t 1-7* 1 + 2Kp^d i V 2 

where the last inequality uses the assumption 7^ > 2(1 + 2K,p 2 )5f This proves the first claim. 
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Case 2: p 2 < rf t . We split into two sub-cases. Suppose rf t < (p(l — St) — l)/(«;<5f). Then, 

I -S t 



by (15), 



1 — «5* 



1 + ^"-^ 1 + ^^11=1 



Now suppose instead r? t > (p(l — St) — l)/(nSt). Then 



n,t+x 



> 



l -5 t 



I -S t - lj_p 
nS t 



(20) 



Observe that if mim^i r 2 t < (p(l — St) — l)/(nSt), then r^t+i > \r^t\ for all i 6 [fc], and hence 



7i+l > 7t- Otherwise we have jt+i > 1 



k6± 
l-St-l/p 



> 1 — 1/p. This proves the third claim. 



If min^i rf t > (p(l — St) — l)/(nSt), then we may apply the inequality (20) from the second 
sub-case of Case 2 above to get 



R. 



t+i 



E^i(Ai/A,) 2 /^ +1 



1/2 



> 



1-S t -1/ P \ A 



nS t 



A I 



1 



This proves the fourth claim. 

Finally, for the last claim, if Rt < l+2np 2 , then by ( 16 ) from Proposition B.2 and the assumption 
7t > 2(1 + 2 K p 2 )S t , 



Rt+i > Rt 



I -St 



1 



jt 



> Rt 



2(l+2/tp 2 ) 

1 " lt/2 



>Rt[l+if 



1 + 2k P 2 J ~ *V 3 



1 - j t + o t i^ 

This in turn implies that 2 t+l > 9 2 t via Proposition B.l, and thus St+i < St- 

Lemma B.3. Assume < St < 1/2 and 7t > 0. PicA; any (3 > a > smc/i t/iaf 

a e a e 

(l + a)(l + a 2 ) " 2(l + a)(l + /3 2 ) ~ AY 

//-Rt > 1/a, then R t+ i > 1 /a. 
2. Ifl/a > Rt> 1/0, then R t+1 > mm{R 2 /(2K), 1/a}. 
Proof. Observe that for any c > 0, 



□ 



Rt > 1 o 
c 



'l.f 



> 



1 



1 + c 2 

Now consider the following cases depending on i2<. 
• Case 1: Rt > 1/a. In this case, we have 



^ St< 



(1 + c 2 )e 
Ai 



(21) 



6* < 



(l + a 2 )e 
Ai 



< 



Q?7t 
1 + a 



by ( |21[ ) (with c = a) and the condition on a. Combining this with (16) from Proposition B.2 
gives 



R 



I -S t 

> — > 



1 _ OTjt 
1 1 + Q 



1 

a 
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Case 2: 1//3 < Rt < 1/a. In this case, we have 



St < 



(l + /3 2 )6 
Ai 



< 



a 



2(1 + a) 



by (21) (with c = (3) and the conditions on a and j3. If 5 t > 1/(2 + R 2 /k), then (pi) implies 

1 



R 



t+i 



1-S t l-28 t 

> > > 



l+a 



25 t 



l+a 



1 



If instead S t < 1/(2 + Rf/K), then (16) implies 



R. 



t+i 



> 



1 



l 



> 



2+R?Jk 



2k 



□ 



Approximate recovery of a single eigenvector 

We now state the main result regarding the approximate recovery of a single eigenvector using 
the tensor power method on T. Here, we exploit the special properties of the error E (both (12) 
and ([13])). 

Lemma B.4. There exists a universal constant C > such that the following holds. Let i* := 
argmax ie[fc ] Ai|0 i)O |. // 



e < 



To 



2(1 + 8k) 



A 



min ' u i*fl 



and N>C- 



+ log log — — 



7o 



then after t > N iterations of the tensor power method on tensor T as defined in (11) and satisfy- 
ing (12) and (13), the final vector 0t satisfies 

4e 



> \ 1 



3e 
p~\i* 



It ~ Vi*\\ < 



\f(9 u e t ,e t 



Xi*\ < 27k 



p\i* 



+ 2)-. 

V 



Proof. Assume without loss of generality that i* = 1. We consider three phases: (i) iterations before 
the first time t such that Rt > l + 2Kp 2 = 1 + 8k (using p := 2), (ii) the subsequent iterations before 
the first time t such that Rt > 1/a (where a will be defined below), and finally (iii) the remaining 
iterations. 

We begin by analyzing the first phase, i.e., the iterates in T% := {t > : Rt < 1 + 2k/> 2 = 1+8k}. 
Observe that the condition on e implies 



So 



< 



7o 



A, 



A^f o 2(1 + 8k) Ai 



< min 



7o 



2(1 + 2^2)' 2(1 + 2 K p 2 ) 



and hence the preconditions on 5t and 7^ of Lemma B.2 hold for t = 0. For all t G T\ satisfying the 
preconditions, Lemma B.2 implies that 5t+i < St and ^t+i > mm {7i> 1 — I// 5 }) so the next iteration 
also satisfies the preconditions. Hence by induction, the preconditions hold for all iterations in T\. 
Moreover, for all i G [k], we have 

,1 

Ko > 1 ; 

1-70 
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quadratic rate while p 2 < rf t < x (The specific rates are given, respectively, in Lemma 



claims 1 and 2.) Since — ^j- 

2 



l-St-l/p 

p < it follows that min^ir? t < 1 ~":„7'- / ' lor at most 

A, 



and while t £ Ti: (i) |rit| increases at a linear rate while r| t < 2p 2 , and (ii) |rj^| increases at a 

en, respi 

X-St-X/p 



B.2 



In 



To 



\ In a/2 



1-70 



o 



k5, 

I . . Ai 



+ log log — 

7o e 



(22) 



iterations in T\. As soon as min^i rf t > 1 °^ t ±/P , we have that in the next iteration, 



Rt+i > 



l-St-l/p 

6 t -l/p A r 



nS t 



Ai 



s/jfe v/fc 



and all the while i?t is growing at a linear rate (given in Lemma B.2, claim 5). Therefore, there are 
at most an additional 



70 V 7/Vk J 



log(fc/c) 



7o 



(23) 



iterations in T\ over that counted in ( |22[ ). Therefore, by combining the counts in (22) and (23), we 
have that the number of iterations in the first phase satisfies 



Ti| = 0( log log ^ 



log(/cK) " 

7o 

We now analyze the second phase, i.e., the iterates in := {t > : t ^ T\, Rt < 1/a}. Define 

3e „ 1 1 

a? 



a :- 



(3:-- 



\ + 2k P 2 1 + 8k' 



Note that for the initial iteration t' := minT2, we have that Rf > 1 + 2np 2 = 1 + 8k = 
and by Proposition B.l, 7f > 1 — + 8k) > 7/8. It can be checked that St, "ft, ot, and j3 
satisfy the preconditions of Lemma B.3 for this initial iteration t' . For all t E T<i satisfying these 
preconditions, Lemma B.3 implies that Rt+i > rnin{i?t, 1/a}, ^ft+i — mm {^i 1/(1 + a 2 )} (via 
Proposition B.l ), 5t+i < max{<5f, (1 + a) 2 e/Ai} (using the definition of St), and jt+i > min{7t, 1 — 
ok} (via Proposition B.l). Hence the next iteration t + 1 also satisfies the preconditions, and by 
induction, so do all iterations in T^. To bound the number of iterations in T2, observe that Rt 
increases at a quadratic rate until Rt > 1/a, so 



IT2I < In 



( ln(l/a) 
\H(1/P)/(2k)) 



< In 



ln^i 
In 4 



O^loglogy^. 



(24) 



Therefore the total number of iterations before Rt > 1/a is 

/log(fc«) Al 

O h log log — 

V 7o e 

After > 1/a (for t" := max(Ti U T2) + 1), we have 

1/a 2 



'l,tf 



> 



1 + 1/a 2 



> 1 - a z > 1 
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Therefo re, t he vector Q t " satisfies the condition for property (13) of E to hold. Now we apply 
Lemma 



B.3 



using e/p in place of e, including in the definition of 5t (which we call 5t). 

6t " P^lt' 



we also replace a and j3 with a and j3, which we set to 

a-=— B-= — 
p\i ' Ai 

It can be checked that Sf € (0, 1/2), j t " > 1 — 3e«;/Ai > 0, 



o 



> 



> 



o 



> 



(l + a)(l + a 2 ) p (l -3e/e/Ai)Ai pj t "^i 2(l + a)(l + /3 ) PM ' 



Therefore, the preconditions of Lemma |B.3 are satisfied for the initial iteration t" in this final 
phase, and by the same arguments as before, the preconditio ns h old for all subsequent iterations 
t > t" . Initially, we have Rf > 1/a > 1/(3, and by Lemma 



B.3 



we have that Rt increases at a 
quadratic rate in this final phase until Rt > 1 /a. So the number of iterations before Rt > 1/a can 
be bounded as 



In 



ln(l/a) 



dn((l//3)/(2 K )) 
Once i? t > 1/a, we have 



In 



In 2^ 



l n ( Ai . J_ 



< In In 



pAi 
3e 



O loglo! 



pA] 



3e 
pAi 



Since signal,*) = r 1)t > r? >t _ x • (1 - ?t_i)/(l + «;<5 t _irf^_ 1 ) = (1 - + «rf t -l) > by 

Proposition |B,2, we have #1^ > 0. Therefore we can conclude that 



2(1 - M < \ 2 1 - a/1 - (3e/(pAx)) 2 < 4e/(pAi). 



40 



Finally, 

\f(o t ,6 t ,e t )-\i\ 



\i(elt - 1) + J2~Wlt + m,e t ,e t ) 

i=2 
k 

< Ai|(9?, t - i| +J2 MOiABlt + \\E(IAA)\\ 

i=2 

k 

< Ai (1 - 6 U + |M1 " Olt)\) + max \\O itt \ £ 0? t + \\E(I, O t , 9 t ) 



i=2 



< Ai (1 - 9 U + |M1 " »m)I) + max A^l - Q\ t ^ 9% + ||£(7, t , t ) 

^ i=2 

= ~X 1 (1 - 1>t + \9 1>t (l - e 2 ht )\) +maxAi(l - 2 ,) 3 / 2 + ||£(/, t , t )|| 



< Ai • 3 



+ kAi 



< 



3e 

pAi, 

(27K-(i/p\ 1 ) 2 + 2)e 
P 



3e 

+ 

pA x y P 



□ 



B.3 Deflation 

Lemma B.5. Fix some e > 0. Lei {vi, i^, ■ ■ ■ , v^} be an orthonormal basis for M fc , and Ai, A2, . . . , A& > 
with A m j n := minj g [ fe ] Aj. Also, let {v\, V2, ■ ■ • , Vk} be a set of unit vectors in M fe (not necessarily 
orthogonal), X\, A2, . . . , Xk > 6e non-negative scalars, and define 

£i := Xivf 3 - kvf 3 , i€[k]. 

Pick any t £ [k] . If 

\Xi - Xi\ < e, 

\\vi - Vi\\ < min{\/2, 2e/Aj} 
/or a// i £ [i], iaen /or any umt vector u 6 

' 2 / \ * 

^^(/,n,n) < ( 4(5 + lle/A min ) 2 + 128(1 + e/A min ) 2 (e/A™ n ) 2 J e 2 ^(n T n;) 2 

i=\ 2 ^ ' i=l 

i x t 

+ 64(1 + e/A min ) 2 e 2 ^(e/A;) 2 + 2048(1 + e/A min ) 2 e 2 ( J>/Ai) ; 



i=i 



i=i 



In particular, for any A £ (0, 1), there exists a constant A' > (depending only on A) such that 
e < A'X m i n /y/k implies 



t 2 / t 

Si(I, u,u) <(A + 100jj 
i=i 2 ^ i=i 



u T «i) 2 e 2 . 



41 



Proof. For any unit vector u and i 6 [i], the error term 

£i(I,u,u) = Xi(u T Vi) 2 Vi - Xi(u T Vi) 2 Vi 
lives in spanjvj, Oj}; this space is the same as spanj^j , vf- } , where 

:=Vi- (vjvi)vi 

is the projection of V{ onto the subspace orthogonal to v%. Since ||£>i — Vj|| 2 = 2(1 — vjvi), it follows 
that 

Cj := vjvi = 1 - - Vi\\ 2 /2 > 

(the inequality follows from the assumption ||t)j — V{\\ < \/2, which in turn implies < Cj < 1). By 
the Pythagorean theorem and the above inequality for Cj, 

II ~-L Il2 i 2^n» ||2 
pi || = 1 - Cj < \\Vi - Vi\\ . 

Later, we will also need the following bound, which is easily derived from the above inequalities 
and the triangle inequality: 

|1 - cf | = |1 - a + a(l - c|)| < 1 - c t + |ci(l - cf)| < 1.5||«< - Vif. 

We now express £i(I,u,u) in terms of the coordinate system defined by Vi and v^~, depicted 
below. 




Define 

di := u T Vi and bi := u T (y^/\\v--\\). 
(Note that the part of u living in spanj^j, v^~} ± is irrelevant for analyzing £i(I,u,u).) We have 

£i(I,u,u) = Xi(u T Vi) 2 Vi - Xi(u T Vi) 2 Vi 

= Xiafvi - Xi(a,iCi + \\v^\\bi) 2 (ciVi + v^) 

= Xiafvi - Xi(a 2 c 2 + 2\\vi\\aibiCi + \\vl\\ 2 b 2 )ciVi - A^OjCj + \\vf \\h) 2 vl 

= {{Xi - XiCi )a 2 - 2Xi\\vl\\aibiC 2 - Xi\\v^-\\ 2 b 2 c^j Vi - Aj || (ojCj + ll^ll&i) 2 ^/!!^!!) 
v v ' s w ' 

= A i v i -B i (vl/\\vl\\). 
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The overall error can also be expressed in terms of the A{ and Bf. 



u, u, 



1=1 
t 



< 2 



i=i 
t 



iV-j 



i=i 



+ 2 



i=l 



(25) 



<2j>? + 2 El^ 

i=i ^?=i 

where the first inequality uses the fact (x + y) 2 < 2{x 2 + y 2 ) and the triangle inequality, and the 
second inequality uses the orthonormality of the V{ and the triangle inequality. 

It remains to bound A 2 and \Bi\ in terms of |aj|, Aj, and e. The first term, A 2 , can be bounded 



using the triangle inequality and the various bounds on |Aj — Aj|, \\i>i 



\vi\\, and cf. 



\Ai\ < (\Xi - \i\4 + \i\4 - l\)a 2 + 2(A, + \\ - Xi 



\aibi\c 2 + (A, + |Aj - Aj 



\vH 2 b 2 a 



< (|Ai - Ai| + 1.5Ai||t>i - Vi\\ 2 + 2(A» + \\i - \\)\\vi - v*||)|o»| + (A* + |A* - Aj|)||^ - Wi 

< (e + 7i 2 /Xi + 4e + ie 2 /X l )\a i \ + 4e 2 /A l + e 3 /A 2 
= (5 + lU/Xi)e\ai\ + 4(1 + e/X l )e 2 /X i , 



and therefore (via (x + y) 2 < 2(x 2 + y 2 )) 

A 2 < 2(5 + lle/Xi) 2 e 2 a 2 + 32(1 + e/A;) 2 e 4 /A 2 . 
The second term, is bounded similarly: 

|B i |<2(A i + |A i -A i |)||«- L || 2 (a? + ll*i L H 2 ) 

< 2(Xi + \Xi - Xi\)\\vi - Vi\\ 2 (a 2 + \\vi - Vif) 

< 8(1 + e/Xi)(e 2 /Xi)a 2 + 32(1 + e/Xi)e 4 /X* . 
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Therefore, using the inequality from (25) and again (x + y) < 2(x + y ), 

t 

y^£i{i,u,u 



i=l 



<2^4 2 + 2fel^l) 

i=l V i=l ' 

t t 

< 4(5 + lle7A min ) 2 e 2 a\ + 64(1 + e7A min ) 2 e 2 ^(e/A;) 2 

i=l i=l 

/ * * \ 2 

+ 2f 8(1 + e7Amin)(e 2 /Amin) a2 i + 32 ( X + e7Amin)e^(e7A i ) 3 J 

^ i=l i=l ' 

t t 

< 4(5 + lle7A min ) 2 e 2 a i + 64 (1 + e/A min ) 2 e 2 J>/A;) 2 

1=1 2=1 

t , t 

+ 128(1 + e7Amin) 2 (7Amin) 2 e 2 a i + 2048 ( 1 + e 7A min ) 2 e 2 ( j^(e/Aj 

i=l M=i 

4(5 + lle/A^n) 2 + 128(1 + e7A min ) 2 (e7A min ) 2 ] e 2 ^ a 2 

' i=l 

i , t 

+ 64(1 + e7A min ) 2 e 2 £>/A;) 2 + 2048(1 + e/X min ) 2 i 2 ( £(e/Aj 

8=1 ^t=l 



□ 



B.4 Proof of the main theorem 
Theorem 



5.1 



restated. LetT = T + E £ ]R fc x fc x fc ; w/jere T is a symmetric tensor with orthogonal 
decomposition T = Yli=i ^i v f 3 where each Aj > ; {v±, V2, ■ ■ ■ , v^} is an orthonormal basis, and E 
has operator norm e := \\E\\. Define A m i n := min{Aj : i £ [k]}, and A max : = max{Aj : i £ [k]}. 
There exists universal constants C\,C2,C?, > such that the following holds. Pick any r\ £ (0,1), 
and suppose 



e < C 



A, 



i • 



N > C 2 • log(fc) + log log 



A, 



and 



/ ln(L/log 2 (fc/7/)) r ln(ln(L/log 2 (A;/77))) + C3 



41n(L/log 2 (fc/77)) 



' ln(8) 
ln(L/log 2 (A;/»7)) 



> 1.02 1 + 



'ln(4) 
ln(fc) 



(Note that the condition on L holds with L = poly(fc) \og{\/rj).) Suppose that Algorithm^ is 
iteratively called k times, where the input tensor is T in the first call, and in each subsequent call, 
the input tensor is the deflated tensor returned by the previous call. Let (vi, Ai), (pz, Aa), . . . , A&) 
be the sequence of estimated eigenvector/eigenvalue pairs returned in these k calls. With probability 
at least 1 — rj, there exists a permutation it on [k] such that 

IK(j) - Wjll < &e/K{j), \K{j) ~ Aj| < 5e, Vj £ [k], 
i=i 
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Proof. We prove by induction that for each i G [k] (corresponding to the i-th call to Algorithm [T]) , 
with probability at least 1 — irj/k, there exists a permutation tt on [k] such that the following 
assertions hold. 

1. For all j < i, \\v % y\ — %|| < ^e/X^^ and ~ ^j\ < 12e. 

2. The error tensor 

z w ■■= (r-E 3 ) - E WIS) = ^+E ( w?S> - 



satisfies 



\\E i+ i(I,u,u)\\ < 56e, VuGS^ 1 ; (26) 
||£ m (I,u,u)|| < 2e, Vu G S* -1 s.t. 3 j > i + 1 . (u T v w{j) ) 2 > 1 - (168e/A^ (i) ) 2 . (27) 

We actually take i = as the base case, so we can ignore the first assertion, and just observe that 
for i = 0, 

k 

E 1 = f-Y,^f 3 = E. 
i=i 

We have = \\E\\ = e, and therefore the second assertion holds. 

Now fix some i £ [k], and assume as the inductive hypothesis that, with probability at least 
1 — (i — l)rj/k, there exists a permutation tt such that two assertions above hold for i — 1 (call this 
Eventj_i). The i-th call to Algorithm [l] takes as input 

Ti-.= T- E Vf> 
which is intended to be an approximation to 

T i : =E A -(i)^or 



Observe that 



T- — T- — E- 

± i ± i — j- 1 i j 



which satisfies the second assertion in the inductive hypothesis. We may write Tj = X)f=i ^^F 3 
where A; = A; whenever 7T" 1 (Z) > and A; = whenever 7r — 1 (Z) < i — 1 . This form is u sed when 



referring to T or th e Aj in preceding lemmas (in particular, Lemma B.l and Lemma B.4). 

By Lemma B.l , with conditional probability at least 1 — r\jk given Eventj_i, at least one of 6^ 



for r G [L] is 7-separated relative to vr(j max ), where j max := argmaxj>j A^q), (for 7 = 0.01; call this 



Event^; note that the application of Lemma B.l determines C3). Therefore Pr[Eventj_i n Event^] 
Pr[Event^|Eventj_i] Pr[Eventj_i] > (1 — i]/k)(l — {i — l)r]/k) > 1 — irj/k. It remains to show that 
Eventi_i n Event^ C Events so henceforth we condition on Eventi_i n Event^. 

Set 

d := min{(56 • 9 • 102) -1 , (100 • 168)~\ A' from Lemma EU with A = 1/50} . (28) 
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For all r 6 [L] such that 0^ is 7-separated relative to vr(j max ), we have (i) |f?2«,ol > 1/Vk, and 
(ii) that by Lemma B.4 (using e/p := 2e, k := 1, and i* := 7r(j max ), and providing C2), 

l^jv > ^ > ^ ) ~ A vr(i max )l < 5e 

(notice by definition that 7 > 1/100 implies 70 > 1-/(1 + 7) > 1/101, thus it follows from the 
bounds on the other quantities that e = 2pe < 56Ci • < 2 (i+8k) ' Amin ' ®i* as necessary). 
Therefore On ■= N must satisfy 

Wn, On, n ) = maxT^, 6%\ Q% ] ) > max - 5e = A, 0max) - 5e. 
On the other hand, by the triangle inequality, 

Ti(0N,ON,ON) < y~]K(j)Qnm.N + l-^i(%J Qn,0n)\ 
j>i 

^ X] ^(i) I ^OO.AT I ^(i),AT + 56e 

- A 7r(j*)l#7r(j*),Ar| + 5 6e 

where j* := argmaxj>j An-fj) | ^7r(j),iV I • Therefore 

4 

A 7r(j*)|07r(i*),Ar| > A vr(j max ) - 5e- 56e > -A^^). 
Squaring both sides and using the fact that n + ^(j) TV — ^ or an y ^ 

( A ^(i*)^(j*),jv) 2 > ^( A ^(i m ax)^(i*),7v) 2 + ^( A ^(i m a X )^(i),w) 2 



- 25 (^O'*)^^*).^) 2 + 25(^(3)^(1),^) 2 

which in turn implies 



3 

K(j)K(j),N\ < 4 A 7r(j*)l^7r(j*),Af|, 3 + f- 

This means that On is (l/4)-separated relative to Also, observe that 

| fl I ^ 4 A ^(jmax) . 4 A^^) 5 

5 A^y.) 5 A^-.j 4 



Therefore by Lemma B.4 (using e/p := 2e, 7 := 1/4, and « := 5/4), executing another N power 
iterations starting from On gives a vector that satisfies 

86" 

\\0 - Vn(j*)\\ < , |A- < 5e - 

-Mi*) 

Since iii = and Aj = A, the first assertion of the inductive hypothesis is satisfied, as we can modify 
the permutation ir by swapping ir(i) and 7r(j*) without affecting the values of {tt(j) : j ' < i — 1} 
(recall j* > z). 
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We now argue that Ei+i has the required properties to complete the inductive step. By 



Lemma B.5 (using e := 5e and A := 1/50, the latter providing one upper bound on C\ as per 
(28)), we have for any unit vector u G 



^)-¥H ('.«>«) 



( 1 \ 1/2 

< I 1/50 + 100 (« T «7r(?)) 2 ) 5e - 55e - 
V i=i ' 



(29) 



Therefore by the triangle inequality, 



< \\E(I,u,u)\\ + 



E(^i)<?)-vf 



(I,u,u) 



< 56e. 



Thus the bound (26) holds. 



To prove that (27) holds, pick any unit vector u *E S k 1 such that there exists j' > i + 1 with 
(n T ti 7r ( J v)) 2 > 1 — (leSe/A^j/)) 2 . We have (via the second bound on C\ in (28) and the corresponding 
assumed bound e < C\ 



100^(uX (j) ) 2 < 100(l - (« T % v)) 2 ) < lOO^^) 



< 



50 : 



and therefore 



i x 1/2 

1/50 + 100 5^(« T « w (j-)) 2 ) 5e< (1/50 + l/50) 1/2 5e < e. 

5=1 ' 



By the triangle inequality, we have ||£ , j+i(/, u, u)\\ < 2e. Therefore (27) holds, so the second 
assertion of the inductive hypothesis holds. Thus Eventj_i n Event'- C Event,;, and Pr[Eventj] > 
Pr[Eventi_i n Event^] > 1 — irj/k. We conclude that by the induction principle, there exists a 
permutation tt such that two assertions hold for i = k, with probability at least 1 — rj. 

From the last induction step (i = k), it is also clear from (29) that \\T — 52j=i ^j^f^W — 55e (in 
Eventfc-i n Event';.,). This completes the proof of the theorem. □ 



C Variant of robust power method that uses a stopping condition 

In this section we analyze a variant of Algorithm [T] that uses a stopping condition. The variant is 
described in Algorithm [2} The key difference is that the inner for- loop is repeated until a stopping 
condition is satisfied (rather than explicitly L times). The stopping condition ensures that the 
power iteration is converging to an eigenvector, and it will be satisfied within poly(fc) random 
restarts with high probability. The condition depends on one new quantity, r, which should be set 
to r := - # deflation steps so far (i.e., the first call to Algorithm [2] uses r = k, the second call 
uses r = k — 1, and so on). 

C.l Stopping condition analysis 

For a matrix A, we use ||^4||f := (52 i j ^-1 j) 1 ^ 2 to denote its Frobenius norm. For a third-order 
tensor A, we use \\A\\ F := (£, \\A(I, /,'e i )|||) 1 / 2 = (£. \\A(I, I, OH!) 1 / 2 . 
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Algorithm 2 Robust tensor power method with stopping condition 

input symmetric tensor T G M fcxfcxfc 5 number of iterations N, expected rank r. 
output the estimated eigenvector /eigenvalue pair; the deflated tensor. 
1: repeat 

Draw #o uniformly at random from the unit sphere in 
for t = 1 to JV do 

Compute power iteration update 



f(i,e t -i,Qt-i) 

\\f(I,O t -i,0t-i) 



5: end for 

6: until the following stopping condition is satisfied: 



\T{f).x.0y.0.x)\ > niax<j -^\\f\\ F , ^\\f(I,I,9 N )l 



(30) 



7: Do N power iteration updates (30) starting from 9n to obtain 6, and set A := T(9, 9, 9) 
8: return the estimated eigenvector /eigenvalue pair (9, A); the deflated tensor T — A 9® 3 . 



Define T as before in (11): 



i=i 



We assume E is a symmetric tensor such that, for some constant p > 1, 
\\E(I,u,u)\\<e, Vu€S* -1 ; 

\\E(I,u,u)\\ < e/p, Vu G S^ 1 s.t. {u T Vl ) 2 > 1 - (3e/Ai) 2 ; 



Assume that not all Aj are zero, and define 

A min := min{Ai : i G [k], Aj > 0}, 

£:=\{iE[k]:\i>0}\, 



Amax := max{\ : i G [fc]}, 
A« 



v avg 



1 fc \ 1/2 
* i=l 



We show in Lemma C.l that if the stoppi ng co ndition is satisfied by a vector 9, then it must be 
close to an eigenvector of T. Then in Lemma C.2, we show that the stopping condition is satisfied 
by 9n when 9q is a good starting point (as per the conditions of Lemma B.4). 

Lemma C.l. Fix any vector 9 = Yli=i ®i v i> an d ^ ^* := ar g max ie[fc] Assume that £ > 1 

and that for some a G (0, 1/20) and (3 > 2a /Vk, 



e < a ■ — — , ep 



Vk 



< Vi 



n 



f3y/k 



\avg- 
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// the stopping condition 



/ ^ llfll 1 



(31) 



|T(0, 0,0)1 > max 

holds, then 

1. Aj* > /3A avg /2 and Aj*|0j*| > 0; 

2. maxi-^i* \i\8i\ < \fla ■ \i* \0i* \; 

3. 0i*>l- 2a. 

Proof. Without loss of generality, assume i* = 1. First, we claim that Ai|0i| > 0. By the triangle 
inequality, 

k k 

\f(9, e,6)\<J2 + \E{e, e,e)\<J2 M0i\ef + 1 < xm + 1. 



i=i 



Moreover, 



\T\\f> 



i=l 

(k k 
J2~XiVivj( 
.7=1 i=\ 



k 

£ 



X jVjVj 



Vi Vj) 
2 \ 1/2 



2 n 1/2 



1/2 



- \\E\ 



\;=i 

> V^Aavg - 

By assumption, |T(0, 0, 0)| > (/3/VI)\\f\\ F , so 

Ai|6»i| > /3A avg - -^=e F - e > /?A avg - /3 



1 a 



2 ^ 



a 

~7k 



Aavg r— Amin ^ ~ A 



^A 
2 Aav s 



where the second inequality follows from the assumptions on e and e^. Since /3 > 0, A avg > 0, and 
1 6i | < 1, it follows that 

Ai > ^-A avg , Ai|6*i| > 0. 

This proves the first claim. 

Now we prove the second claim. Define M := T(I, I, 6) = X^!=i Xi@i v i v J + 1, 6) (a sym- 
metric k x k matrix), and consider its eigenvalue decomposition 



M = 4>iUiuJ 



i=i 
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where, without loss of generality, \<j)i\ > \4>2\ > • • • > \<ftk\ an d {u±,U2, ■ ■ ■ , is an orthonormal 
basis. Let M := Yli=i^i v i v h so M = M + E(I,I,9). Note that the \i\9i\ and \(f>i\ are the 
singular values of M and M, respectively. We now show that the assumption on \T(9, 9, 9)\ implies 
that almost all of the energy in M is contained in its top singular component. 
By Weyl's theorem, 

|0i| < Ai|0i| + ||M - Af || < \i\6i\ + e. 
Next, observe that the assumption \\T(I, I, 9)\\f < (1 + a)T(9, 9, 9) is equivalent to (l + a)9 T M9 > 



||M||_p. Therefore, using the fact that \q 
fact ||^4||f < V^H^lll for any matrix A G 



ikxk 



max ngS .fc_i \u T Mu\, the triangle inequality, and the 



[1 + a)\(f>i\ > (1 + a)9 T M9 > \\M\\ F 



(32) 



> 



y] K9iViv 

i=l 

k s 1/2 

=1 

k N 1/2 

2/i2 

1=1 



|£(/,/,f 



> (E^ 2 ) -^( J > J > 

i=l 

E* 2 



fee. 



Combining these bounds on \<j)i\ gives 

Ai|0i| + e> 



1 



1 + a 



l< N 1/2 _ 

[2 ° 2 ^ -Vke 



E* 2 

%=i 



(33) 



1/2 



The assumption e < aX mm /Vk implies that 

v^e < aA min < a^E ^l) 
N=l ' 

Moreover, since Ai|#i| > (by the first claim) and Ai|#i| = max i6 ^ Aj|#j|, it follows that 



so we also have 



Ai|#i| > Aminmaxl^l > 

ie[k] Vk 



e < aAi|#i| 



(34) 



Applying these bounds on e to (33), we obtain 

k 



1 + a) 



^' 2 > ' ° .( At^ • max A 2 " 



i=l 



1 + a 



1/2 



which in turn implies (for a € (0, 1/20)) 



maxA 2 ^ 2 < 



|l±^-l).A 2 2 <7o.A 2 2 . 
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Therefore maxj^i \i\6i\ < v7a • Ai|#i|, proving the second claim. 

Now we prove the final claim. This is done by (i) showing that has a large projection onto 
u±, (ii) using an SVD perturbation argument to show that ±u\ is close to v\, and (iii) concluding 
that has a large projection onto v\. 

We begin by showing that (uJ6) 2 is large. Observe that from (32), we have (1 + a) 2 (/) 2 > 
\\MWp > 4>\ + maxj^x and therefore 



max | fa | < \Jlot + a 2 ■ \(f>i\. 
Moreover, by the triangle inequality, 

k 

\9 T M8\ <Y\(/)i\(uj6) 2 < |0i|K^) 2 + max|^|(l- «0) 2 ) = «0) 2 (|<£i| - max |&|) + max |^| 

i=l 



Using (j32J) once more, we have \6 T M9\ > \\M\\ F /(1 + a) > |0i|/(l + a), so 
l 

( u l & > U^- 1 "- 77 777Y ^ 1 



1-max^jM ( i + a )(i_ maXi ^J|il) ~ (l + a)(l-V2o7+^)- 

Now we show that (ujvi) 2 is also large. By the second claim, the assumption on e, and (34), 
Ai|0i| - maxAi|^| > (1 - Vfa) ■ Ai|0i| > (1 - Vfa) • A min /Vfc. 

Combining this with Weyl's theorem gives 

| (pi | — max Ai| 6i \ > \\\6i\ — e — max Aj|#j| > (1 — (a + V7a)) • \ m m/Vk, 

so we may apply Wedin's theorem to obtain 

\\(pi\ - maxj^i \i\Vi\J \l - (a + \/7a); 

It remains to show that 6\ = vJ9 is large. Indeed, by the triangle inequality, Cauchy-Schwarz, and 
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the above inequalities on (ujvi) 2 and (uJ9) 2 , 

k 



\vje\ 



8=1 



> 



K«iIK0|-^kt Wi ||^| 



i=2 



/ k \l/2/k v 

= K^ilK^I - ((i - K T ^i) 2 ) (i - K T #) 2 )) 



1/2 



> 1 



(l + a)(l- \/2a + a 2 ) 



a 



a 



a \ 

(1 + a)(l- V2a + a 2 ) \\-{a + VTa)J 



1 — (a + v7a) 

2\ V2 



> 1 - 2a 

for a £ (0, 1/20). Moreover, by assumption we have T(9, 9, 9) > 0, and 



f (9,9,9) = ^^ + £(9,9,9) 
i=i 

= Ai0? + ^A> 3 + E{9,e,6) 



i=2 



< Ai0f + maxAi|0i| V^ + e 

i/i 

^ i=2 



< Ai6»J + V7aAi|0i|(l - 6»f) + e (by the second claim) 



< Axl^l 3 sign(0!) + 



7a 



(l-2a) 2 



x 7a + 



a 



(i-2 a y 



1/2 



(since |0i| > 1 - 2a) 



< Ai|6>i| 3 (sign(0i) + 1^ 

so sign($i) > —1, meaning 9\ > 0. Therefore 9\ = \6\\ > 1 — 2a. This proves the final claim. □ 
Lemma C.2. Fix a, (3 £ (0, 1). Assume Aj* = max ig [ fc ] Aj and 



e < min 



I — 7= 1— -= — 1-Aj*. £f < VI ■ ^-r~ • Ai* 



15^+7' 7 



2/3 



To the conclusion of Lemma B.4, it can be added that the stopping condition (31) is satisfied by 
9 = 9 t . 
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Proof. Without loss of generality, assume i* = 1. By the triangle inequality and Cauchy-Schwarz, 



/ \ 1/2 

\\f(i,i,e t )\\ F < Xi\e ljt \ + J2 x ^\ + WJ,e t )\\ F < x 1 \e 1 , t \ + x x Vk[ J2 d t \t) + Vfc. 



< X 



id,* 



3Vke 
P 



+ Vke. 



where the last step uses the fact that B\ t > 1 — (3e/(pAi)) 2 . Moreover, 

2 



r(0 t ,0*,0 t )>Ai- 27 



pAi 



+ 2 



Combining these two inequalities with the assumption on e implies that 

1 



T(9 t ,9 t ,e t ) > 



-\\T(i,i,e t )\\ F . 



l + a 

Using the definition of the tensor Frobenius norm, we have 

k 



1 IITII < 1 
-p i If S —p 

VI VI 



i=i 



+ -^f\\E\\f 

F VI 



1 



1 



Combining this with the above inequality implies 



T(I,I,9 t )>^=\\T\\ F . 



Therefore the stopping condition (31) is satisfied. □ 
C.2 Sketch of analysis of Algorithm [2] 

The analysis of Algorithm [2] is very similar to the proof of Theorem 5.1 for Algorithm [TJ so here 
we just sketch the essential differences. 

First, the guarantee afforded to Algorithm [2] is somewhat different than Theorem 5.1. Specifi- 
cally, it is of the following form: (i) under appropriate conditions, upon termination, the algorithm 
returns an accurate decomposition, and (ii) the algorithm terminates after poly(/c) random restarts 
with high probability. 

The conditions on e and N are the same (but for possibly different universal constants C%, C2). 
In Lemma C.l and Lemma C.2 there is reference to a condition on the Frobenius norm of E, but 
we may use the inequality H-EHf < ^11^ II < ^ e so that the condition is subsumed by the e condition. 

Now we outline the differences relative to the proof of Theorem |5.1| The basic structure of the 
induction argument is the same. In the induction step, we argue that (i) if the stopping condition 
is satisfied, then by Lemma C.l (with a = 0.05 and /3 = 1/2), we have a vector 9n such that, for 
some j* > i, 

1. A^-*) > A^ 0max) /(4 v / fc); 



2. 6^ is (l/4)-separated relative to 
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3. 6^, )iN > 4/5; 



and (ii) the stopping condition is satisfied within poly(fc) random restarts (via Lemma B.l and 



Lemma C.2) with high probability. We now invoke Lemma B. 4 to argue that executing another N 



power iterations starting from 6j\f gives a vector 9 that satisfies 



Mi*) I 



< 



8e 



A 



I A - K{j*)\ < 5e. 



The main difference here, relative to the proof of Theorem 5.1 is that we use k := Ay/k (rather 
than k = O(l)), but this ultimately leads to the same guarantee after taking into consideration 
the condition e < CiX m { n /k. The remainder of the analysis is essentially the same as the proof of 
Theorem 15.11 



D Simultaneous diagonalization for tensor decomposition 

As discussed in the introduction, another standard approach to certain tensor decomposition prob- 
lems is to simultaneously diagonalize a collection of similar matrices obtained from the given tensor. 
We now examine this approach in the context of our latent variable models, where 

k 

M 2 = Hi® Hi 

i=i 

k 

M 3 = fli ® Hi ® (M- 

i=l 

Let V := [hi\H2\ ■ • • |^fc] and ^K 7 ?) '■= diag^r/, fj£i], . . .,/i^r/), so 

Mi = V diag(wi, W2, ■ ■ ■ Wk)V T 
M 3 (I,I,r)) = Vdiag(w 1 ,w 2 ,...w k )D(ri)V T 

Thus, the problem of determining the Hi can De cas t as a simultaneous diagonalization problem: 
find a matrix X such that X r M 2 X and X T M^(I, I, rj)X (for all rf) are diagonal. It is easy to see 
that if the Hi are linearly independent, then the solution X T = V* is unique up to permutation 
and rescaling of the columns. 

With exact moments, a simple approach is as follows. Assume for simplicity that d = k, and 
define 

M( V ) := Mz(I,I,ri)M 2 l = VD{ri)V~ x . 

Observe that if the diagonal entries of D{r]) are distinct, then the eigenvectors of M(rj) are the 
columns of V (up to permutation and scaling). This criterion is satisfied almost surely when r\ is 
chosen randomly from a continuous distribution over 

The above technique (or some variant thereof) was used in }MR06l IAHK121 IAFH+121 IHK12] 
to give the efficient learnability results, where the computational and sample complexity bounds 
were polynomial in relevant parameters of the problem, including the rank parameter k. However, 
the specific polynomial dependence on k was rather large due to the need for the diagonal entries 
of D(rj) to be well-separated. This is because with finite samples, M{rj) is only known up to some 
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perturbation, and thus the sample complexity bound depends inversely in (some polynomial of) 
the separation of the diagonal entries of D(rj). With rj drawn uniformly at random from the unit 
sphere in the separation was only guaranteed to be roughly l//c 2,5 [AHK12J (while this may 
be a loose estimate, the instability is observed in practice). In contrast, using the tensor power 
method to approximately recover V (and hence the model parameters jii and Wi) requires only a 
mild, lower-order dependence on k. 

It should be noted, however, that the use of a single random choice of rj is quite restrictive, and it 
is easy to see that a simultaneous diagonalization of M(rj) for several choices of 77 can be beneficial. 
While the uniqueness of the eigendecomposition of M(rf) is only guaranteed when the diagonal 
entries of D(rj) are distinct, the simultaneous diagonalization of M(j]^>), M(rj^'), . . . , M(?/ m )) for 
vectors t/ 1 ), rf 2 \ . . . , r/"^ is unique as long as the columns of 



(1) 

(2) 



^2 V 
H2V 



(1) 
(2) 



(1) 
(2) 



Mfc?? (m) . 



are distinct (i.e., for each pair of column indices there exists a row index r such that the (r, i)-th 
and (r, j)-th entries are distinct). This is a much weaker requirement for uniqueness, and therefore 
may translate to an improved perturbation analysis. In fact, using the techniques discussed in 



Section 4.3, we may even reduce the problem to an orthogonal simultaneous diagonalization, which 
may be easier to obtain. Furthermore, a number of robust numerical methods for (approximately) 
simultaneously diagonalizing collections of matrices have been proposed and used successfully in the 
literature (e.g., [BGBM931 ICSMl ICar94l ICC96I IZLNM04] ). Another alternative and a more stable 
approach compared to full diagonalization is a Schur-like method which finds a unitary matrix 
U which simultaneously triangularizes the respective matrices |CGT97j . It is an interesting open 
question whether these techniques can yield similar improved learnability results and also enjoy the 
attractive computational properties of the tensor power method. 
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